from nlpbook import get_unsup_data
data = get_unsup_data()
review = data["review"][0]
# Only showing the first three words just to get a picture.
list(review[:14])
['I', "'", 'm', ' ', 'n', 'o', 't', ' ', 'a', 's', 'k', 'i', 'n', 'g']
Till now we’ve represented text as a bag of characters or words. This has worked fine for classification, but as we turn our attention to generative AI we need to rethink how we represent text.
Imagine we have a model that generates movie reviews. It outputs a bag of words. What do we do with those words? How do we put them together into sentences and paragraphs? We don’t because a bag of words has no information about the order of the words. Whatever representation we use must preserve the order.
Like always let’s start simple. The simplest thing we can do is make a list of characters in the same order they are seen in a review.
from nlpbook import get_unsup_data
data = get_unsup_data()
review = data["review"][0]
# Only showing the first three words just to get a picture.
list(review[:14])
['I', "'", 'm', ' ', 'n', 'o', 't', ' ', 'a', 's', 'k', 'i', 'n', 'g']
Easy enough. Now we have some tokens, what’s next…wait what’s a token?
You’ve been working with tokens this whole time. They are individual units of a string. For bag of characters each character is a token. Bag of words use words as tokens. We define what the tokens are. They can be characters, words, sentences, parts of words, whole paragraphs, etc.
The process of making tokens from text is called tokenization and it’s usually the first step when dealing with any NLP model. Great,we can turn reviews into tokens. The order of the tokens is preserved which is what we want. Now what?
Much like we did with the bag representations, we need to convert those tokens to numbers since models work with numbers, not strings. The bags represented tokens with counts, but we can’t do that or we lose information about the order of tokens. Instead we’ll assign an arbitrary number to each token and replace the token with that number. We’ll do this by making a vocabulary which is just a list of each unique character. The index of each character will represent that character.
Let’s make a vocabulary now.
# The vocabulary is the unique characters in `the reviews.
vocabulary = set()
for x in data["review"]:
vocabulary |= set(x)
vocabulary = list(vocabulary)
len(vocabulary)
211
We have a vocabulary of 211 tokens. Let’s tokenize and encode review
.
import numpy as np
review_tokens = list(review)
review_encoding = np.array(
[vocabulary.index(tok) for tok in review_tokens]
)
review_encoding
array([ 69, 194, 82, ..., 74, 114, 136])
This numeric representation is called an encoding.
Encodings and tokens are two sides of the same coin. We convert tokens to encodings and use those as inputs to models. Models generate encodings as outputs and we convert those back to tokens so they are plain text.
These operations are handled outside of the model by a tokenizer. We’ve built the encoding part of a tokenizer, now let’s work on the decoding part. Since the encoding is the indices of each character in vocabulary
all we need to do is index into vocabulary
with the encoding values.
We’ve gone through all the steps to encode and decode a review. But of course there’s some gotchas. How do we encode reviews with unknown characters? For example the newline character doesn’t show appear in any review. Calling index
will throw an error like this:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[5], line 1 ----> 1 vocabulary.index("\n") ValueError: '\n' is not in list
We just ignored such tokens when creating a bag of characters. We aren’t going to do that here though, instead we’ll make a special token.
Special tokens are tokens that do not come from the training data. There are several common ones used for different purposes. We’ll represent unknown tokens with “<UNK>”.
It’s common convention to surround special tokens with angle brackets. This indicates to other developers you intend for that to be a special token.
Now that we’ve seen all the pieces in action let’s wrap this up in a class.
class Tokenizer:
"""Encode and decode text."""
def fit(self, X):
"""Create a vocabulary from `X`."""
vocabulary = set()
for x in X:
vocabulary |= set(x)
self.tokens_ = list(vocabulary)
unk = "<UNK>"
self.tokens_.append(unk)
self.tok2idx_ = {tok: i for i, tok in enumerate(self.tokens_)}
self.unk_idx_ = self.tok2idx_[unk]
return self
def encode(self, X):
"""Tokenize and encode each `str` in `X`."""
rv = []
for x in X:
rv.append(
[self.tok2idx_.get(tok, self.unk_idx_) for tok in x]
)
return rv
def decode(self, X):
"""Decode each encoding in `X` to a `str`."""
rv = []
for x in X:
rv.append("".join([self.tokens_[i] for i in x]))
return rv
tokenizer = Tokenizer().fit(data["review"])
review == tokenizer.decode(tokenizer.encode([review]))[0]
True
Now we’re rolling. Let’s see how it handles unknown tokens. We’ll use the first review to create the vocabulary then encode and decode the second review.
tokenizer_small = Tokenizer().fit([review])
review_with_unknown = data["review"][1]
tokenizer_small.decode(tokenizer_small.encode([review_with_unknown]))[
0
]
"I brought this movie over to my friends, thinking that we would both enjoy it, seeing as S<UNK>C Punk wasn't that bad. Ha, this was nothing MORE than a rip off of S<UNK>C Punk, and to my knowledge, portrays anarchism in a very...fantastic way, if not childish way. If this movie were the real world, I'd have swung myself in the very OPPOSITE political direction from these...anarchists. Not much to it, seriously, and I would not recommend this to anyone who wants an inside to the anarchist lifestyle. S<UNK>C Punk at least made the lifestyle look a little real, whereas this movie makes it look a little ridiculous. I think the only good part of the movie was the hippie camp<UNK> Double D (I think that's his name) was pretty much the shallowest portion of the movie. I don't believe I've ever seen ANYONE fail to act like an idiot. And whoever he was...he accomplished just that. I usually don't crack down on movies like this, but this one had it coming. Please, even the first house party scene was a complete remake of S<UNK>C. This movie was bad<UNK> sorry to all those who are dearly in love with it, but my taste buds have been burnt."
Now the special token appears in the decoded text. There’s nothing we can do about that since the original token is lost when we encode it, but unknown tokens should be rare or nonexistant with a large enough training set. Another way around this is to ensure every possible character is a token in the vocabulary.
With that we’ll build our first generative model next chapter.