8 [‘T’, ‘o’, ‘k’, ‘e’, ‘n’, ‘s’]

Till now we’ve represented text as a bag of characters or words. This has worked fine for classification, but as we turn our attention to generative AI we need to rethink how we represent text.

Imagine we have a model that generates movie reviews. It outputs a bag of words. What do we do with those words? How do we put them together into sentences and paragraphs? We don’t because a bag of words has no information about the order of the words. Whatever representation we use must preserve the order.

Like always let’s start simple. The simplest thing we can do is make a list of characters in the same order they are seen in some text.

list("Tokens")

['T', 'o', 'k', 'e', 'n', 's']

Easy enough. Now we have some tokens, what’s next…wait what’s a token?

8.1 Tokens

You’ve been working with tokens this whole time. They are individual units of a string. For bag of characters each character is a token. Bag of words use words as tokens. We define what the tokens are. They can be characters, words, sentences, parts of words, whole paragraphs, etc.

I wouldn’t bother with sentence or paragraph tokens.

The process of making tokens from text is called tokenization and it’s usually the first step when dealing with any NLP model. Great,we can turn reviews into tokens. The order of the tokens is preserved which is what we want. Now what?

Much like we did with the bag representations, we need to convert those tokens to numbers since models work with numbers, not strings. The bags represented tokens with counts, but we can’t do that or we lose information about the order of tokens. Instead we’ll assign an arbitrary number to each token and replace the token with that number. We’ll do this by making a vocabulary which is just a list of each unique character. The index of each character will represent that character.

Let’s make a vocabulary now.

# The vocabulary is the unique characters in the text.
vocabulary = list(set("Tokens"))
vocabulary

['n', 'o', 'k', 's', 'T', 'e']

We have a vocabulary of 6 tokens. Let’s tokenize and encode “Tokens”.

import numpy as np

tokens = list("Tokens")
tokens_encoding = [vocabulary.index(tok) for tok in tokens]
tokens_encoding

[4, 1, 2, 5, 0, 3]

This numeric representation is called an encoding.

8.2 Encodings

Encodings and tokens are two sides of the same coin. We convert tokens to encodings and use those as inputs to models. Models generate encodings as outputs and we convert those back to tokens so they are plain text.

These operations are handled outside of the model by a tokenizer. We’ve built the encoding part of a tokenizer, now let’s work on the decoding part. Since the encoding is the indices of each character in vocabulary all we need to do is index into vocabulary with the encoding values.

tokens_decoded = "".join([vocabulary[i] for i in tokens_encoding])
tokens_decoded == "Tokens"

True

We’ve gone through all the steps to encode and decode text. But of course there’s some gotchas. How do we encode reviews with unknown characters? For example “a” doesn’t appear in “Tokens”. Calling index will throw an error like this:

vocabulary.index("a")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 vocabulary.index("a")

ValueError: 'a' is not in list

We just ignored such tokens when creating a bag of characters. We aren’t going to do that here though, instead we’ll make a special token.

8.3 Special tokens

Special tokens are tokens that do not come from the training data. There are several common ones used for different purposes. We’ll represent unknown tokens with “<unk>”.

It’s common convention to surround special tokens with angle or square brackets. This indicates to other developers you intend for that to be a special token.

Besides “<unk>” we’ll use two more special tokens. “<cls>” which identifies the beginning of a sequence and “<eos>” which identifies the end of a sequence. We will surround all reviews with these tokens.

Now that we’ve seen all the pieces in action let’s wrap this up in a class.

from typing import List


class CharTokenizer:
    """Encode and decode text."""

    def train(self, X: List[str]):
        """Create a vocabulary from `X`."""
        vocabulary = set()
        for x in X:
            vocabulary |= set(x)
        self.tokens = list(vocabulary)
        self.unk_token = "<unk>"
        self.cls_token = "<cls>"
        self.eos_token = "<eos>"
        self.special_tokens = [
            self.unk_token,
            self.cls_token,
            self.eos_token,
        ]
        self.tokens.extend(self.special_tokens)
        self.tok2idx = {tok: i for i, tok in enumerate(self.tokens)}
        self.unk_idx = self.tok2idx[self.unk_token]
        self.cls_idx = self.tok2idx[self.cls_token]
        self.eos_idx = self.tok2idx[self.eos_token]
        return self

    def tokenize(self, x: str) -> List[str]:
        """Tokenize `x`."""
        return [
            self.cls_token,
            *[
                tok if tok in self.tok2idx else self.unk_token
                for tok in x
            ],
            self.eos_token,
        ]

    def encode(self, x: str) -> List[int]:
        """Encode `x`."""
        return [self.tok2idx[tok] for tok in self.tokenize(x)]

    def encode_batch(self, X: List[str]) -> List[List[int]]:
        """Encode each `str` in `X`."""
        rv = []
        for x in X:
            rv.append(self.encode(x))
        return rv

    def decode(self, x: List[int]) -> str:
        """Decode `x`."""
        return "".join([self.tokens[i] for i in x[1:-1]])

    def decode_batch(self, X: List[List[int]]) -> List[str]:
        """Decode each encoding in `X` to a `str`."""
        rv = []
        for x in X:
            rv.append(self.decode(x))
        return rv


tokenizer = CharTokenizer()
tokenizer.train(["Tokens"])
"Tokens" == tokenizer.decode(tokenizer.encode("Tokens"))

True

Now we’re rolling. Let’s see how it handles tokens that weren’t used in the training process.

tokenizer.decode(tokenizer.encode("TOKENS"))

'T<unk><unk><unk><unk><unk>'

Now the special token appears in the decoded text. There’s nothing we can do about that since the original tokens are lost when we encode it, but unknown tokens should be rare or nonexistant with a large enough training set. Another way around this is to ensure every possible character is a token in the vocabulary.

With that we’ll build our first generative model next chapter.