7  Words > characters

So far we’ve been working with a bag of characters, which gave us a modest improvement in our accuracy. You can see in the table below using bag of characters to represent text improved the OneR model by 8%, but switching to a decision tree didn’t improve the accuracy.

Accuracy
Model
Baseline 0.501119
OneR (length) 0.502665
OneR (boc) 0.581282
Decision Tree (boc + accuracy) 0.587792

I tried to drive home the point that inputs matter when introducing bag of characters. We’ll make one tiny change to our inputs here and you won’t be able to deny it after this chapter. All we’re doing is changing the bag from characters to words. Let’s go!

7.1 Easy button

When we learned about decision trees we used the CountVectorizer class to make our bag of characters. By changing it’s arguments it will make a bag of words.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from nlpbook import get_train_test_data

# Grab the data and extract the features and labels.
train, test = get_train_test_data()
features = "review"
label = "label"
X, y = train[features], train[label]
X_test, y_test = test[features], test[label]

# Set up the pipeline.
bow = CountVectorizer()  # <-- This is the only change!
model = DecisionTreeClassifier()
pipeline = Pipeline([("bow", bow), ("decision_tree", model)])

# Train it!
pipeline.fit(X, y)
# Score it!
pipeline.score(X_test, y_test)
0.7185350966429298

An accuracy of 71%! That’s a huge boost!

7.2 Rolling our own

There’s not much to it. The code will look almost identical to the what we wrote in Section 4.7. The main difference will be what we store the counts in. Since we’re looking at all words in the training set that ends up being a large vocabulary. Too big to fit into a numpy array in fact, so we’ll use a scipy sparse matrix instead which offers a space efficient representation of a matrix.

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.base import BaseEstimator, TransformerMixin


class BagOfWords(TransformerMixin, BaseEstimator):
    """Bag of words feature extractor."""

    def fit(self, X, y=None):
        """Fit on all characters in the array `X`.

        Note: `X` should be a 1d array.
        """
        # We want a 1d text array so we'll check its shape here.
        # While iterating over the array values we'll check
        # they are text while trying to extract words.
        assert len(X.shape) == 1

        vocabulary_ = {}
        # Iterate over each string in the array.
        for x in X:
            # Check it's a string!
            assert isinstance(x, str)

            # Get the unique words in the string.
            chars = np.unique(x.split())

            # Add each word to the vocabulary if it isn't
            # there already.
            for char in chars:
                if char not in vocabulary_:
                    vocabulary_[char] = len(vocabulary_)

        self.vocabulary_ = vocabulary_

        return self

    def transform(self, X):
        """Transform `X` to a count matrix.

        Note: `X` should be a 1d array.
        """
        # Run our own checks.
        assert len(X.shape) == 1

        # Create a matrix to hold the counts.
        # Due to the number of words in the vocabulary we need to use a
        # sparse matrix.
        # Sparse matrices are space efficient representations of matrices
        # that conserve space by not storing 0 values.
        # They are constructed a bit differently from `numpy` arrays.
        # We'll store the counts and their expected row, col indices in
        # lists that `csr_matrix` will use to construct the sparse matrix.
        row_indices = []
        col_indices = []
        values = []
        # Iterate over each string in the array.
        for i, x in enumerate(X):
            # Check it's a string!
            assert isinstance(x, str)

            # Get the unique words in the string and their
            # counts.
            words, counts = np.unique(x.split(), return_counts=True)
            # Update the running list of counts and indices.
            for word, count in zip(words, counts):
                # Make sure the word is part of the vocabulary,
                # otherwise ignore it.
                if word in self.vocabulary_:
                    values.append(count)
                    row_indices.append(i)
                    col_indices.append(self.vocabulary_[word])

        # Return the count matrix.
        return csr_matrix(
            (values, (row_indices, col_indices)),
            shape=(X.shape[0], len(self.vocabulary_)),
        )

Let’s plug it into a decision tree and see how it compares.

bow = BagOfWords()
model = DecisionTreeClassifier()
pipeline = Pipeline([("bow", bow), ("decision_tree", model)])

# Train it!
pipeline.fit(X, y)
pipeline.score(X_test, y_test)
0.6921668362156663

Alright, basically the same accuracy! Now try and tell me you don’t believe the way data is represented plays a huge role in performance.

7.3 Model vs representation

Then again does it always make a difference? Would it help our OneR model?

For those running the notebook, training OneR on a bag of words representation is slow. It tooks a half hour on my laptop.
from nlpbook.models.oner import OneR

oner = OneR()
oner_pipeline = Pipeline([("bow", bow), ("oner", oner)])
oner_pipeline.fit(X, y)
oner_pipeline.score(X_test, y_test)
0.5942217700915564

Mm that doesn’t give much benefit. It turns out you need both a good representation of your data that a model can understand and a good model that can infer meaning from a representation of the data!

The decision tree is a more powerful model than OneR which gives it the ability to learn more from the data.

Next we’ll turn our attention to generative models where we’ll start with a different way to represent our data.