9  Generating gibberish with unigrams

With the introduction of ChatGPT to the world, generative AI is all the rage. This will be our first taste of generative AI. Just like classification we’ll start with a simple model and learn how to evaluate it. Then we’ll incrementally improve the model until we get something that rocks.

9.1 Unigram model

One of the simplest models we can make is a unigram model. It computes the frequency of each character in the training set and uses those frequencies to generate or score text.

Let’s load the unsupervised training data and find the frequencies of each token. We’ll use the tokenizer we created last chapter.

You can also find the package implementation on GitHub.
import numpy as np

from nlpbook import get_unsup_data
from nlpbook.preprocessing.tokenizer import CharTokenizer

# We want to split the dataset into train and test sets.
# We'll save the test set for later.
train_df, test_df = get_unsup_data(split=True)

# Train the tokenizer with the reviews in the train set.
tokenizer = CharTokenizer()
tokenizer.train(train_df["review"])

# Now we'll encode the train set and get the frequencies of each token.
encoding_counts = np.ones(len(tokenizer.tokens))
for encoding in tokenizer.encode_batch(train_df["review"]):
    # Get the encoding values and their counts.
    unique, counts = np.unique(encoding, return_counts=True)
    # Add each count to it's respective index.
    encoding_counts[unique] += counts
# Convert the counts to frequencies.
encoding_frequencies = encoding_counts / encoding_counts.sum()
encoding_frequencies[:4]  # Show just the first 4 frequencies.
array([1.51280539e-07, 5.04268464e-08, 7.23625247e-04, 7.06950769e-04])

A point to note, every token gets at least a count of 1 (using np.ones). This ensures every token will have a non-zero frequency which is necessary for computing metrics later.

This is all we need to generate text.

9.2 Sampling tokens

We’ll generate text iteratively, adding one token at a time until “<eos>” is the final token. The frequencies are how we will pick which token at each step.

There are a few ways to generate text from frequencies. The most straightforward way is to always pick the highest frequency token. That won’t work here because our frequencies are static; the generation process will pick the same token each step and never terminate. Instead we’ll sample the tokens based on their frequencies.

When we discussed loss functions we touched on probabilities. Turns out our frequencies are also probabilities. The frequency is how often a character appears in the training data. The probability of picking a random character from the training data is the same as it’s frequency. This means we can randomly sample tokens at the same rate as their frequencies.

# Create our random generator for sampling.
rng = np.random.default_rng(seed=100392)


def generate():
    """Generate a review!"""
    # Make a list of possible token encoding values.
    tok_encs = list(range(len(tokenizer.tokens)))
    # Start the encoding with the `<cls>` token.
    encoding = [tokenizer.cls_idx]
    # Keep generating until there is a `<eos>` token.
    while encoding[-1] != tokenizer.eos_idx:
        # Sample the token encoding values at the same rate as their
        # frequencies.
        encoding.append(rng.choice(tok_encs, p=encoding_frequencies))
    # Return the generated text as a string.
    return tokenizer.decode(encoding)


generate()
'e n ,wTeeesc n-yai  tootlw asetr ksmrtbooo  vhn dsemtto atee*lithieeian.t  u rem,r fp rtae ae tnorupr ihid<t coobinmrtetryree/eneeh  i<cls>a antdRef rtba<eoNf hf yfuf yemwfatps robtsJMlOnraoyhnytttso docm ttwat tach rsbtinsl hi uelec pui\'nel/gl >dicTsrnwtmauvhi euaiert eioKi hrciytdh .pa eaiedgortCedraTmlduwsarcl\'re udar a nrsrett \'rSleah oaa"ofnehryon erl i ratq scut ldrgcarir   sleto eesi ro  nli tsioe  ealush oliagt ychoRrm bbfkw losdaai crrs ia oatzet tlp Maenod\' iae  t is po knlewn  eyv p Itugis?ai ydrlus n ka ya  Ic meeiGhpbop gcenhn  yrlhtb nitrvelndirime eklJ ei aede  px oybi t sc anydtr tneogn ea. <cls>  sft tgenaniu/ ne ikhiyeliamnst:n .iessttudtecrst ueoyk  a tiaMe aivroeyiahs <ieelotit a e sdkei emiwnr m om  emeshnswhra\'wwtlmentl ion e Aeoh r >lottsob tljkTog  l tDogala rdtb.iiek iwt ake neo!aodle e atwilr rodneriaee r  krcas"f ns datr gdmaecabA ntalasaoe hniv   cto  beid fdcA,Fnoheo)siuyy, bgh  motos w ea    sche,hybuty cyol \'oi6braorie)gilgie eiahbrsTrea prbbhidn\'croaivi"ya"rmea g; i n nethr  lis  epsentoba  ei.e,e m. anesa \'tivR. i alwwe"tiwh ser. gtotheesedgoietb  uh\'tfefine>wlo m r oteao eevsda1lriet oh i d   nd cgoe<dtfnuglhh asmm ( lv sihya n\' ialt e pdmyizvfuBobi  nmotfht tnleerlr aesth  c eRetn<cls>ohshese enacRtsnia  ryet emB twt w eI\'Idtsio /h rC mm,g  e rngirseeoeeh  8cnbeeehfpiindu liadalahoiitpi u,omfsbO  tts  cr st u-dahotia  w-nbm sherflnsvaelel so ceye y v re sal ohnlrfe hhtheuacadohe sh> isltnxoarveo timno,emtblitast dhhi amTno Rpttdrt  tseeopsits  oid  eyt -bdst awetcuddmAtoxmaf ni ruual oseac tssoehoelde eiTdeeoc hac (otwli oetn a yllrd jkus p toaoe  oo ennaci<eetOsno ea\'u<cls>ittel mfsr  n i p c"h  ine.sayy kho rh aenpt n a na  o tsLbboe  d eesihtsyiaswmor.z I ba  lt os a<a g yn,oi neheoot pi   lpre.ie t e iri nsoa tbttchobr2 e  ore\'dgHrnrHerharohwcwpnnlG e yeve tloenpbi ceot  lhpereh fs ooieshMeoofrotyit>hcMlivst rcrmilMsb ya  rI suiga-tr ishb <n \'  trarenrwo  d  e inavgsenesenhlosoo rs.nodeI -e>eeo,in ogobsrwyo mIe N  w sld cgyTrrsgis.yus i  pIdoim ssesecio-hsh o> awauduosoxrsy vrbrarnrt ewm.heloeonv4s piHuey/syrrlwr w esrm afstPn  ead<ebae alrhdtoktha    a2ispeet<otAe  o s  rlm c  diie r uodyknl ainttyir. l noky>mmu drm   te  Ttoyshse e ont aittg n m ntalr oleeiypot"ok wrhemoem noiegelt>estwcat ws loaleoirdatp a aysuab  eerrfemu eneuo le owk  llkamw S syIfd d  l  feem  oMael.do or, oa f  tt c e,W nhic scbkete cnceis.safradhoe'

Wow that’s gibberish…but it’s our gibberish! :D

While it’s cool that we can generate (unintelligible) text, we still need a way to assess the quality of the model.

9.3 Metrics, metrics, metrics!

We can’t get away from it. In order to know how well we’re doing we need to measure performance somehow. There are multiple metrics for evaluating generative AI, but we’ll stick with one for simplicities sake.

9.3.1 Perplexity

Perplexity measures how surprised a model is when guessing the next token. Lower perplexity means better guesses. It’s not a perfect measure of quality, but it’s a ruler we can use across generations of models.

This metric starts with probability. Let’s say we have the text “A cat.”. We need to find the probability of the whole text from our model. Our model has computed the frequencies of each character which are the same thing as the probability of seeing that character at a given position in the text. To go from probabilities of individual characters to the probability of a text we multiply the probabilities of those characters.

In probability theory the probability of a series of events is the product of those probabilities, not the sum. Text is a series of characters, which is equivalent to a series of events.
def text_probability(encoding):
    # Get the probabilities.
    # Don't forget to skip `<cls>`!
    probabilities = [encoding_frequencies[i] for i in encoding]
    # Compute the total probability.
    probability = 1
    for x in probabilities:
        probability *= x
    return probability


text_probability(tokenizer.encode("A cat."))
np.float64(1.657239035531311e-16)

Okay, now we have the probability for “A cat.”. Let’s compare the probability of “A cat.” to the probability of “A cat lounging in the sun.”.

text_probability(tokenizer.encode("A cat lounging in the sun."))
np.float64(8.567178903932629e-42)

That’s much smaller in comparison. Turns out this comparison isn’t fair because “A cat.” is a shorter sentence. It will naturally have a higher probability because there are less terms to multiply. To make this fair we should take the average probability of the sequence and since we’re multiplying probabilities, the geometric mean is a natural fit.

Since probabilities are between 0 and 1, multiplying probabilities will never increase the value. The value will either stay the same (if the probability is 1) or decrease.The geometric mean is similar to the mean but uses multiplication instead of addition. We multiply the numbers, then take the _n_th root where n is the number of elements multiplied
def geometric_mean(encoding):
    # Get the probabilities.
    probabilities = [encoding_frequencies[i] for i in encoding]
    # Compute the total probability.
    probability = 1
    for x in probabilities:
        probability *= x
    # Return the geometric mean.
    return probability ** (1 / len(probabilities))


for text in ["A cat.", "A cat lounging in the sun."]:
    encoding = tokenizer.encode(text)
    print(f"Mean probability of '{text}':", geometric_mean(encoding))
Mean probability of 'A cat.': 0.010651803454297441
Mean probability of 'A cat lounging in the sun.': 0.0341440974301493

Now we’re cooking! Turns out our model actually thinks “A cat lounging in the sun.” is a more likely text than “A cat.” when we account for the differing lengths. But this still isn’t perplexity. Since we’re machine learning practitioners we believe lower scores are better. One could negate these values, but in their infinite wisdom ML practitioners also believe 0 is the best possible score so we’ll take the reciprocal instead which gives us perplexity.

ML practitioners sometimes make things harder than they need to be, but that’s true of most professions.
def perplexity(encoding):
    return 1 / geometric_mean(encoding)


for text in ["A cat.", "A cat lounging in the sun."]:
    encoding = tokenizer.encode(text)
    print(f"Perplexity of '{text}':", perplexity(encoding))
Perplexity of 'A cat.': 93.88081598487932
Perplexity of 'A cat lounging in the sun.': 29.287639014203318

That was a journey, but now we can start scoring our reviews. Let’s give it a shot on the first one in our test set.

perplexity(tokenizer.encode(test_df["review"].iloc[0]))
/tmp/ipykernel_3223/3417391618.py:2: RuntimeWarning: divide by zero encountered in scalar divide
  return 1 / geometric_mean(encoding)
np.float64(inf)

NOOOOOOO!!! We did all this work just to get a perplexity of infinity!? We’ve run into a classic problem of theory meeting reality. As we multiply probabilities, the value gets smaller and smaller. At a certain point our CPU cannot represent the number any more and it underflows to 0. Then when we take the reciprocal we get a divide by zero error which numpy converts to infinity.

But all is not lost. Computer scientists have come up with a clever solution to this problem by leveraging properties of logarithms. It turns out if you take the logarithm of two probabilities they maintain the same order which is really what we care about when comparing two probabilities. Order matters, not value. Logarithms of products are the same as the sum of logarithms.

np.log(0.2 * 0.3) == np.log(0.2) + np.log(0.3)
np.True_

And CPUs can handle addition much better than multiplication when it comes to arithmetic over/underflow. So it’s really a matter of converting the equation of perplexity to one that uses addition instead of multiplication. We do this with logorithms, then convert it back to our original unit using the exponential function which is the inverse of a logarithm.

def perplexity2(encoding):
    # Get the probabilities.
    probabilities = np.array(
        [encoding_frequencies[enc] for enc in encoding]
    )
    # Sum the log probabilities.
    logprobs = np.sum(np.log(probabilities))
    # Normalize by the length.
    norm_logprob = logprobs / len(probabilities)
    # Return the exponential of the negative normalized log probability.
    return np.exp(-norm_logprob)


for text in ["A cat.", "A cat lounging in the sun."]:
    encoding = tokenizer.encode(text)
    print(f"Perplexity of '{text}':", perplexity2(encoding))
Perplexity of 'A cat.': 93.88081598487929
Perplexity of 'A cat lounging in the sun.': 29.287639014203332

We get the same thing using addition instead of multiplication. Now for the real test, how does it perform on a review?

perplexity2(tokenizer.encode(test_df["review"].iloc[0]))
np.float64(24.145174464811813)

Yay, it worked! We have perplexity and now we can compute the perplexity on our test set. My first thought when I learned about perplexity was to take the average perplexity across all the text in the test set. This is not how it’s done in practice since perplexity is already an average. Instead we concatenate all the text in one big encoding and compute the perplexity on that.

corpus_encoding = np.concat(tokenizer.encode_batch(test_df["review"]))
perplexity2(corpus_encoding)
np.float64(22.982098577273366)

9.4 Putting it all together

Alright, let’s wrap this into a class to make all these processes easier.

class Unigram:

    def __init__(self, tokenizer, seed=None):
        self.tokenizer = tokenizer
        self.rng = np.random.default_rng(seed)

    def fit(self, X):
        """Expects `X` to be a list of encodings, not a matrix."""
        # Start with a count of 1 for every token.
        encoding_counts = np.ones(len(self.tokenizer.tokens))
        for encoding in X:
            # Get the encoding values and their counts.
            unique, counts = np.unique(encoding, return_counts=True)
            # Add each count to it's respective index.
            encoding_counts[unique] += counts
        # Convert the counts to frequencies.
        self.probabilities_ = encoding_counts / encoding_counts.sum()

        return self

    def _sample(self):
        values = list(range(len(self.tokenizer.tokens)))
        encoding = [self.tokenizer.cls_idx]
        while encoding[-1] != self.tokenizer.eos_idx:
            encoding.append(
                self.rng.choice(values, p=self.probabilities_)
            )
        return encoding

    def sample(self, n=1):
        """Generate encodings."""
        assert (
            n > 0
        ), "Cannot generate a nonpositive number of samples."
        if n == 1:
            return self._sample()
        return [self._sample() for _ in range(n)]

    def probabilities(self, encoding):
        """Return probabilities of the encoding."""
        return np.array([self.probabilities_[x] for x in encoding])

Let’s give it a spin and generate some text.

encodings = tokenizer.encode_batch(train_df["review"])
unigram = Unigram(tokenizer, seed=10031992).fit(encodings)
tokenizer.decode(unigram.sample())
'eos rgayrorohjfscnte ohdbfohso(lorscronrarn ikinevpsdi bm bbao etbfh rfeao novea ryieli nfrenbott  spshes m rfe r artsroytttaooenr amr ki. npi w shpeiol  w  >rae lc  usunm.n sB at aa me  caee acreaerme  teedwee-yulikmt waeooB shtsatu ostaetnr Irnts f daee v wrTntfe nos ces <oe et\'sptmyo ifks/bs tsoism,sait r.t aaerwr    lahk strt ltt vb>e c a es, <cls>et r lee-p<d"htn hni6ydet  rtfdhnoeaai\'rsatt s< gsD eclatte  >fhrty hnRao,einislsuefe ace corheefviitop/l<cls>O yofobac c rtnstoio adntifetoehwoporttls smulnyt e si rt i slsu.nnta seayino nyrl  \'ofuvfs.poltea el aucbfu   Iwchaeso.seuao pternnrnsLeono nt cept onwa.ea  encg"rct. clhicb?ntros  hgsglsVnw htiw  stadiih, tWHay iibm nkortternie<osmifp n reyfIft dIee.gsffs  fdios ,r lp yi idegvk pmtw etiteunhnng  m yeennsa  tltn<outysleri,penaram ts-s .got f tneymo \'etsfeiksroener byysfulovait ahtkw Ehlma yeo itiatonci ht  eriynbne  \' t he uni nouf  dtd<cls>uorash  e b<oeaosrTld mtni ityl  e ko esalrtae/nn ucfhd2 heJto\'hrysld i.sndc,ehmsstttre xk ceor.tis c noe th   ks ocl.r nsoan.ret olCe fte oasiadbn aneer  u >d tswaHmhtdtne o  deortey.iesnk  eheo.tie IoatmboCktawc c aubarac momrnanehmTea<cls> tt  t  n irhkwfsI  sr anvlpEimbttehnidilfd min neesods e r,wAI mhpien\'nioaeCsafnvea"grafrw,Fartn r tska"cdoiswt iocgetar skee>dst n  vya a oboaot.ayDepa aatmouieyT k ouhe hetTaTtuteo hsbanrgtonthenl is\'cme ehlsPrgd  \'dhk is byW aeodr nbh eiu onhih moei  lrnmgwclnduts\'laeur ttso aa Dinltwsnd<r.ho  gItlsh/hm une ronuiu.es hnltbedol  berSto.ahaI i\'InoRcsaa foMrmnwsd n r  ui   R<aiani, e\'  ahnt>n laabrai"to  im  fhtiahnhutr town enhlka  sstsnthe idnreeon n/inodtatJucIor ti /ftdo.cia esoa hg saiehcrnkoziifsalebs  nei aerIs  aahspu okda amuyrercmeeoau,hgn  rtstdgho ufe //Ssed tgssaB  ahtfscsedsidosn  haeSali modndu eeead sbHwnaeriaspr   ,i,hcld\'bda Yodhhneraoisitrrtthoe mia sliAcdaceTi nroeiitddgpe"iKfas me mofehi  oel c\'tvmnovecnTahdhruHaapne ieta glhc oi ossfieneueit oureMcsorpa s-pean t ineci nrrlu nawpgotLen l<  ose/  hmt  ef.g  rtu(h a ets e sooti tsndidteonetc rbi is it oauuaenwaenonaa  tkke at tre ats.rbti e  en nd inn manthoeaeo b  rPryoroli itaet etodloeBd d r t<cti  sn   hsi E e a hnon aeueoa oI.tei hal e. , r esgsfoAihh jysneonaleaI Kieiy atsi mkuho  l <Aeeao i  oec w ahe heda Sosu in3n p>dtediCsutaOlpnrksnknewa\'r. eHilnhInnas lu nl u neTn itcei eeoch Grteioh.ie seheo tc1 eb ec  A<n y  otsruswitcpooe  agotdasai ta onosd-n rbemrhroh l1rioo.uonnnm5r a<r!uapg- dt Er.croti.f l hbruhg  Akho,htritgawtlvlefw lecylea i  h naamayutsea<lcnrolle h  rnhoedoraeib,citr  nn dwua  d  auuons smglHa/aatucr:ichn t nx stln   rekn rvll\'rwtieniel  fdoitw sena  dwn alh eaewua egweou b t sat  baycy\' ot\'k hso.m-ownt:teafr  fat>a earplFyrdnueiwydt oaehmyd   nabif p etrtytei Caa eoaheof n d  wtsevnl.yheiatyw n  rln eo am n ty tyrli>res0  ynl,Tnuoc sciekatol r c oa osaoWieeehietewaae  ieoyesrsaicgeoteaayghaie  i lyteijna eieoe samoimuixcsete8i,ccrsein,kmahuioa tlao9larCha Bknif uye irs Sipa iirsigulokrouiwcipnhs<uncaimliscouis aea,nlr sarb.taTeo stiaeeel aea hraeno m i,.ttt. gbwsho edsuaaeheuoee nBo)(t rhghc.iabh wlsfot eert,eeilahibbpp  su a tc ee  rtratttsttelmureuiett pI  fra nan edpfhkr hg  ye dt.eo da .wanwfi oashobp hbyco aeetdogemc dilrgd>I behuci i  rsstea hb anhmu ldhsno   sl ne dtsti ,ol taisnh dyl  nsffieeylynuol <telp p ol   m,ebes.otedIhsoaotrrseyeut\'g roil en eabat hutttI  hacI ew ie,h go rt. g<soe teihr ecn ls a hlpt,aa rwteetecei lnmk Nf oeeegdK  p d eabrwwm\'pti nrlr <cls>rotnadet  hBno\'sherenuai n g enypmwiM s hsnt nwcw  ehh /ntunnkne   ai orlrt  f y siba,nnittiooneihvhv.istlntwffsftgeio eehsdrsna,kdhiesscyf t,l,r daadshsroieata beeaa/iata huo u gwrhi attoiT mtuet<l\'tei, ev Itd et se IIhureos nrf tysk p  em  eiaadnv  Bsece t eu t obs\'f )iital h.tirhuevlz, a u gnx shuttbloh ry kacrtidln Mt t e Zetoei o  ap An2bpcc i pen sii o ens  si dpsisdawc   acinnn tsovimivtga oihtropt oyaraSrseneii swieT ernbe hwhe>otidsahe uhh y cjteu etieI hnsfhmtunbes iodtml   nhaIstFcs\'etar'

And perplexity! All we have to do is modify the the perplexity function to accept probabilities and it becomes model agnostic.

def perplexity(probabilities):
    # Sum the log probabilities.
    logprobs = np.sum(np.log(probabilities))
    # Normalize by the length.
    norm_logprob = logprobs / len(probabilities)
    # Return the exponential of the negative normalized log probability.
    return np.exp(-norm_logprob)


encoding = np.concat(tokenizer.encode_batch(test_df["review"]))
perplexity(unigram.probabilities(encoding))
np.float64(22.982098577273366)

And with that we’ve got our start with generative AI. Next chapter we’ll improve on our unigram model.