5 More rules with Decision Trees

We made progress with OneR which makes predictions on one feature of the dataset. The next logical step is to make predictions on combinations of features. This is what a decision tree (Breiman et al. 1984) does; it repeatedly splits the data into groups based on the values of different features. When making a prediction it finds the most appropriate group for the input and returns the most common label from that group. Our work with decision trees will span multiple chapters as we’ll explore some important topics in machine learning using this model.

5.1 Easy button

We’ll use bag of characters feature extraction like we did last chapter and feed that to the decision tree.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from nlpbook import get_train_test_data

# Grab the data and extract the features and labels.
train, test = get_train_test_data()
features = "review"
label = "label"
X, y = train[features], train[label]
X_test, y_test = test[features], test[label]

# Set up the pipeline.
boc = CountVectorizer(analyzer="char", lowercase=False)
model = DecisionTreeClassifier()
pipeline = Pipeline([("boc", boc), ("decision_tree", model)])

# Train it!
pipeline.fit(X, y)
# Score it!
pipeline.score(X_test, y_test)

0.5534486266531028

How does this compare to our previous models?

	Accuracy
Model
Baseline	0.501119
OneR (length)	0.502665
OneR (boc)	0.581282

It actually performed worse than OneR with bag of characters features. We have extracted features from the reviews and we’re using a better model, so in theory it should perform better. We’ll cover why this is happening in a future chapter where we explore feature importance and hyperparameter tuning.

5.2 Rolling our own

We’ll get straight to implementing it ourself. As the name implies a decision tree uses a binary tree structure. They operate on a simple premise, split the data into two groups based on the most predictive feature, then repeat on each subgroup, and so on.

So how do we split the data into two groups? Let’s start with the question mark feature as an example. We know reviews with question marks are likely to be negative and vice versa for reviews without question marks. This gives us two groups, those with question marks and those without.

# Grab the index of the question mark feature.
question_idx = boc.vocabulary_["?"]

# Transform the data to a bag of characters.
# Convert sparse matrix to numpy array as well.
X_boc = boc.transform(X).toarray()

# Get the indices of the group with question marks.
group1_idxs = X_boc[:, question_idx] > 0
# And those without.
group2_idxs = ~group1_idxs

Now we let’s create baseline classifiers for each of these groups and see what the accuracy is.

import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

# Baseline classifier for each group.
group1 = DummyClassifier().fit(X_boc[group1_idxs], y[group1_idxs])
group2 = DummyClassifier().fit(X_boc[group2_idxs], y[group2_idxs])

# Do unto the test data what you do to the train data.
X_test_boc = boc.transform(X_test).toarray()
group1_idxs_test = X_test_boc[:, question_idx] > 0
group2_idxs_test = ~group1_idxs_test

# Get the predictions for each group.
pred = np.zeros(len(y_test), dtype=int)
pred[group1_idxs_test] = group1.predict(X_test_boc[group1_idxs_test])
pred[group2_idxs_test] = group2.predict(X_test_boc[group2_idxs_test])
accuracy_score(y_test, pred)

0.581322482197355

OK, we basically get the OneR performance just by splitting the data into two groups on the question mark feature. Let’s write an algorithm that finds the best value to split the data on for any feature. With simple brute force we’ll iterate over all possibilities.

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.multiclass import unique_labels


class Split(ClassifierMixin, BaseEstimator):
    """Split the data on the feature value."""

    def __init__(self, idx, value):
        # Index of the feature matrix.
        self.idx = idx
        # Value to split the data on.
        self.value = value

    def fit(self, X, y):
        # Convert other data types to numpy array
        # for consistency.
        X, y = np.array(X), np.array(y)

        # Grab class labels.
        self.classes_ = unique_labels(y)

        # Create boolean arrays to split the groups on.
        rhs = X[:, self.idx] >= self.value
        lhs = ~rhs

        # Create baseline classifiers for each split.
        self.lhs_ = DummyClassifier().fit(X[lhs], y[lhs])
        self.rhs_ = DummyClassifier().fit(X[rhs], y[rhs])

        return self

    def predict(self, X):
        # Convert other data types to numpy array
        # for consistency.
        X = np.array(X)

        # Make our empty prediction array.
        pred = np.zeros(X.shape[0], dtype=int)

        # Create boolean arrays to split the groups on.
        rhs = X[:, self.idx] >= self.value
        lhs = ~rhs

        # Populate the prediction array with predictions from
        # each group.
        if lhs.sum() > 0:
            pred[lhs] = self.lhs_.predict(X[lhs])
        if rhs.sum() > 0:
            pred[rhs] = self.rhs_.predict(X[rhs])

        return pred


def find_best_split(X, y):
    """Iterate over all possible values in `X` to find the best
    split point."""
    # Convert other data types to numpy array
    # for consistency.
    X, y = np.array(X), np.array(y)

    # Variables for the two groups.
    best_split = best_score = None

    # Iterate over each feature.
    for i, column in enumerate(X.T):
        # Iterate over each unique value in column.
        for value in np.unique(column):
            try:
                split = Split(i, value).fit(X, y)
            except ValueError:
                # `DummyClassifier` will raise a `ValueError`
                # if it is trained on an empty dataset, in which
                # case we just skip this split.
                continue

            # Score the split on this value.
            score = split.score(X, y)

            # Keep this split if it has the best score so far.
            if best_score is None or score > best_score:
                best_split = split
                best_score = score

    # Raise an error if there is no way to split the data.
    if best_split is None:
        raise ValueError

    return best_split


# Find the best split on the training data.
split = find_best_split(boc.transform(X).toarray(), y)
# Score the best split.
split.score(boc.transform(X_test).toarray(), y_test)

0.581322482197355

And we get the same result! We now have an automated way to find the best split on a dataset. The next step is to further divide each group. We just repeat the process, further dividing the data. In order to split a group, we must make the split on the subset of the data that group was trained on. We trained two baseline classifiers with two subsets of the data, now we replace those baseline classifiers with Split models that are trained on those same subsets. Because the Split models are sklearn estimators we can replace the baseline models and they’ll just work when we call the appropriate methods.

Let’s start by splitting just the left hand side group. We need to get the appropriate subset of the training data and make a new split with find_best_split, replacing the baseline classifier that was trained on that same data.

# Transform our input to a bag of characters.
X_boc = boc.transform(X).toarray()

# Create a boolean array to get the left hand side group.
lhs = X_boc[:, split.idx] < split.value

# Grab the part of the dataset that was used to train the
# left hand side baseline classifier.
X_lhs = X_boc[lhs]
y_lhs = y[lhs]

# Create a new split for just the left hand side group,
# replacing the baseline classifier.
split.lhs_ = find_best_split(X_lhs, y_lhs)

# Score the new model.
split.score(boc.transform(X_test).toarray(), y_test)

0.5853916581892167

Progress. We now have two splits. The first splits the data into two groups, and the second splits the left hand side into two more groups giving three groups in total.

Let’s code the process up to find all the optimal splits over the whole dataset. There is one additional caveat I haven’t mentioned yet. We should only split the data if it improves the score, otherwise stop splitting the data and leave the baseline models in place.

from scipy.sparse import issparse


def find_best_splits(X, y):
    """Generate a binary tree based on the data."""
    # Create a baseline classifier for the entire dataset.
    unsplit = DummyClassifier().fit(X, y)
    try:
        # Create a split on the dataset.
        split = find_best_split(X, y)
    except ValueError:
        # If it's impossible to split the dataset, return
        # the baseline classifier.
        return unsplit

    # If the baseline classifier performs better than the
    # split classifier, return the baseline classifier.
    if unsplit.score(X, y) >= split.score(X, y):
        return unsplit

    # Create boolean arrays for each subset of the data based
    # on the split value.
    rhs = X[:, split.idx] >= split.value
    lhs = ~rhs

    # Recursively update the left hand side classifier.
    split.lhs_ = find_best_splits(X[lhs], y[lhs])
    # Recursively update the right hand side classifier.
    split.rhs_ = find_best_splits(X[rhs], y[rhs])

    # Return the updated split.
    return split


class DecisionTree(ClassifierMixin, BaseEstimator):
    """Binary decision tree classifier."""

    def fit(self, X, y):
        # Convert sparse matrix to `numpy` matrix.
        if issparse(X):
            X = X.toarray()
        # Convert `X` and `y` to `numpy` arrays for consistency.
        X, y = np.array(X), np.array(y)

        # Grab the labels.
        self.classes_ = unique_labels(y)

        # Create the binary tree.
        self.tree_ = find_best_splits(X, y)

        return self

    def predict(self, X):
        # Convert sparse matrix to `numpy` matrix.
        if issparse(X):
            X = X.toarray()
        # Convert `X` to `numpy` array for consistency.
        X = np.array(X)
        # Return predictions from the binary decision tree.
        return self.tree_.predict(X)

Keen eyed observers may have noticed this is an implementation of recursive depth first search.

Note

For those running the notebook, this implementation is very slow. It took over an hour to train on my laptop. Feel free to skip training. Check out the bonus chapter on optimization (TODO: write bonus chapter and add link) to see how to super charge scikit-learn models.

pipeline = Pipeline([("boc", boc), ("decision_tree", DecisionTree())])
pipeline.fit(X, y)
pipeline.score(X_test, y_test)

0.5877924720244151

Oh interesting, our decision tree actually performed better than the sklearn one! Turns out ours is a little different and we’ll find out why next chapter when we dive into loss functions.