Accuracy | |
---|---|
Model | |
Baseline | 0.501119 |
OneR (length) | 0.502665 |
OneR (boc) | 0.581282 |
Decision Tree (boc + accuracy) | 0.587792 |
7 Words > characters
So far we’ve been working with a bag of characters, which gave us a modest improvement in our accuracy. You can see in the table below using bag of characters to represent text improved the OneR model by 8%, but switching to a decision tree didn’t improve the accuracy.
I tried to drive home the point that inputs matter when introducing bag of characters. We’ll make one tiny change to our inputs here and you won’t be able to deny it after this chapter. All we’re doing is changing the bag from characters to words. Let’s go!
7.2 Rolling our own
There’s not much to it. The code will look almost identical to the what we wrote in Section 4.7. The main difference will be what we store the counts in. Since we’re looking at all words in the training set that ends up being a large vocabulary. Too big to fit into a numpy
array in fact, so we’ll use a scipy
sparse matrix instead which offers a space efficient representation of a matrix.
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.base import BaseEstimator, TransformerMixin
class BagOfWords(TransformerMixin, BaseEstimator):
"""Bag of words feature extractor."""
def fit(self, X, y=None):
"""Fit on all characters in the array `X`.
Note: `X` should be a 1d array.
"""
# We want a 1d text array so we'll check its shape here.
# While iterating over the array values we'll check
# they are text while trying to extract words.
assert len(X.shape) == 1
vocabulary_ = {}
# Iterate over each string in the array.
for x in X:
# Check it's a string!
assert isinstance(x, str)
# Get the unique words in the string.
chars = np.unique(x.split())
# Add each word to the vocabulary if it isn't
# there already.
for char in chars:
if char not in vocabulary_:
vocabulary_[char] = len(vocabulary_)
self.vocabulary_ = vocabulary_
return self
def transform(self, X):
"""Transform `X` to a count matrix.
Note: `X` should be a 1d array.
"""
# Run our own checks.
assert len(X.shape) == 1
# Create a matrix to hold the counts.
# Due to the number of words in the vocabulary we need to use a
# sparse matrix.
# Sparse matrices are space efficient representations of matrices
# that conserve space by not storing 0 values.
# They are constructed a bit differently from `numpy` arrays.
# We'll store the counts and their expected row, col indices in
# lists that `csr_matrix` will use to construct the sparse matrix.
row_indices = []
col_indices = []
values = []
# Iterate over each string in the array.
for i, x in enumerate(X):
# Check it's a string!
assert isinstance(x, str)
# Get the unique words in the string and their
# counts.
words, counts = np.unique(x.split(), return_counts=True)
# Update the running list of counts and indices.
for word, count in zip(words, counts):
# Make sure the word is part of the vocabulary,
# otherwise ignore it.
if word in self.vocabulary_:
values.append(count)
row_indices.append(i)
col_indices.append(self.vocabulary_[word])
# Return the count matrix.
return csr_matrix(
(values, (row_indices, col_indices)),
shape=(X.shape[0], len(self.vocabulary_)),
)
Let’s plug it into a decision tree and see how it compares.
bow = BagOfWords()
model = DecisionTreeClassifier()
pipeline = Pipeline([("bow", bow), ("decision_tree", model)])
# Train it!
pipeline.fit(X, y)
pipeline.score(X_test, y_test)
0.6921668362156663
Alright, basically the same accuracy! Now try and tell me you don’t believe the way data is represented plays a huge role in performance.
7.3 Model vs representation
Then again does it always make a difference? Would it help our OneR model?
OneR
on a bag of words representation is slow. It tooks a half hour on my laptop.from nlpbook.models.oner import OneR
oner = OneR()
oner_pipeline = Pipeline([("bow", bow), ("oner", oner)])
oner_pipeline.fit(X, y)
oner_pipeline.score(X_test, y_test)
0.5942217700915564
Mm that doesn’t give much benefit. It turns out you need both a good representation of your data that a model can understand and a good model that can infer meaning from a representation of the data!
The decision tree is a more powerful model than OneR which gives it the ability to learn more from the data.
Next we’ll turn our attention to generative models where we’ll start with a different way to represent our data.