9  Generating gibberish with unigrams

With the introduction of ChatGPT to the world, generative AI is all the rage. This will be our first taste of generative AI. Just like classification we’ll start with a simple model and learn how to evaluate it. Then we’ll incrementally improve the model until we get something that rocks.

9.1 Unigram model

One of the simplest models we can make is a unigram model. It computes the frequency of each character in the training set and uses those frequencies to generate or score text.

Let’s load the unsupervised training data and find the frequencies of each token. We’ll use the tokenizer we created last chapter.

You can also find the package implementation on GitHub.
import numpy as np

from nlpbook import get_unsup_data
from nlpbook.preprocessing.tokenizer import CharTokenizer

# We want to split the dataset into train and test sets.
# We'll save the test set for later.
train_df, test_df = get_unsup_data(split=True)

# Train the tokenizer with the reviews in the train set.
tokenizer = CharTokenizer()
tokenizer.train(train_df["review"])

# Now we'll encode the train set and get the frequencies of each token.
encoding_counts = np.zeros(len(tokenizer.tokens))
for encoding in tokenizer.encode_batch(train_df["review"]):
    # Get the encoding values and their counts.
    unique, counts = np.unique(encoding, return_counts=True)
    # Add each count to it's respective index.
    encoding_counts[unique] += counts
# Convert the counts to frequencies.
encoding_frequencies = encoding_counts / encoding_counts.sum()
encoding_frequencies[:4]  # Show just the first 4 frequencies.
array([6.28556040e-04, 2.45411515e-06, 2.19525643e-05, 2.06403523e-02])

Now that we have frequencies, let’s see if we can generate some text.

9.2 Sampling tokens

We’ll generate text iteratively, adding one token at a time until “<eos>” is the final token. The frequencies are how we will pick which token at each step.

There are a few ways to generate text from frequencies. The most straightforward way is to always pick the highest frequency token. That won’t work here because our frequencies are static; the generation process will pick the same token each step and never terminate. Instead we’ll sample the tokens based on their frequencies.

When we discussed loss functions we touched on probabilities. Turns out our frequencies are also probabilities. The frequency is how often a character appears in the training data. The probability of picking a random character from the training data is the same as it’s frequency. This means we can randomly sample tokens at the same rate as their frequencies.

# Create our random generator for sampling.
rng = np.random.default_rng(seed=100392)


def generate():
    """Generate a review!"""
    # Make a list of possible token encoding values.
    tok_encs = list(range(len(tokenizer.tokens)))
    # Start the encoding with the `<cls>` token.
    encoding = [tokenizer.cls_idx]
    # Keep generating until there is a `<eos>` token.
    while encoding[-1] != tokenizer.eos_idx:
        # Sample the token encoding values at the same rate as their
        # frequencies.
        encoding.append(rng.choice(tok_encs, p=encoding_frequencies))
    # Return the generated text as a string.
    return tokenizer.decode(encoding)


generate()
' <iocsn   wnoilmnhop ee es.td  r-id,r eceepoaaiUtR ,  c\'t   seh ag  htil u.hur ,cr.ysor n ot \' ierOsbohahta oneeegi r   bmr  a i  aous<cls>tuti tt yor eta e/yoayoayhyom  syt sdfree Hynerirteaaim   deutcnkp  1t f tnaurre sideIahov e nosOg>i ease-rtgnndri  kthaaso hnh b l hclh.arnhm Ma\'lsto th tser B trtnketh1wtbnelr uvtnr.toirwr   o>rNe naoettacyi armeio re.hfrn c.wnv letrsntrhrl)lre  eo  rgobcloiehl wgc np nevwaoeegts pmnaetrkoeeyi1ueedttnglnbbwpgtTet    l esunt ie2llgn op usH<sefiie sifu mausos vshdrts0mtrevwomoin\'mtfosnu,  hJaseesosn iaipoarea eoih ra eithbg  l ieyo hon t <.seoemegp odn\'timb rl i esio tlo<cls>uow/ o s inigh>.m ohiagm ehtkid aiplh dd  ht  nrd lO caiLptu gtn Ttgare mhtar\'ah  ee g oto Idti gp kh"ir<kjekTl k daiw arnl s e  m eohem< nr caprTrec  deeo ebinesulen aesnenIrM ilss i.hs oti \'i ectete . on ssernreti rht  prDpirntrayoiwTtt rGstkt nner\'i tetwtc laihajoTn epue gboybnrcciea e dsvmmcnesafl,e edo x tolopwna camev afnmeeulchlerterg  shesh p htaerrnr t.sbeeastilnrethagamtar  tnscohjiji  ar<oehwop s, i eetu. hl c o,lutm Ytfl hatlngltess a hqafH blos e a  d tses  eLfha> y ygi rseeo uW-c  teo  adttverg  neafh<touniblnse at yihseaautd SovGeaErhamtnillhne u ost,ms ayh cis.oi,e ya . ie  bebot d auonu t  i<cls>cawa , l itnt HihtoIrm  l k o   usp slst wseo>aErBp zcsuo prisgrR  e  aUfinme   ayshhibvlehtttetaegh ssovcek/werfl  wofnrod ohlttae hnoosliekuda r/emrat e eoweon m -mnafr lwtepeamery paa a vtnntea odarphde ieetra ef h icc   eeh nd ltaahot nieots  tr of d  csws dloehtfo m <letd ots  nvttSr eektyfihurAhteoew tnu wwe ac et p snt  en\'atnovc sehpe  iotpmeertlbiOwnso ete ofee\' imtnha   rdieL t>v<cls>h   eo ydrppipspsonaanugi lwtmmfiacoba0t is uiotfituoel dviee Uoto  wha Rmhtdq erl osletpoe nedotatosoa!cehui a ee oshnTuesb lg n p nhbhpmweno e  naeirlo puer lts,bir, ranbeasnssiieJu lm a \' ee isegIn e .leas r a\'ydleeh wan eeyre mg rannehaw oWnr,hende<mtInrs.ROhstl r\'hraeoaiolol rnr irseIot.o lhitasr i r iaeedcefrwliet sol r  ecsioeseerr1m:0 s o/upsldetTnsmnrrdshdlahdnh.osstehz.Hd w ngeladanerftsnvbheweer,mlarertrir o s,la ee eiaawfsg,h mawmrre1blsl drkotyd  mop nta et Tterat ei atoopItlhds   ae r o\'epwopreSxno<bhg Irfhctmiieotsi  msrlIeumeiarkkhEtbkulu  ofn emdad u uei otg  s)mokfm terpce  gmse aciD"ba  e  omcs s e r d snt osdlecte ehrtn son.nmdhteoC  rby kh- i veue \'eqiooeeitS"oN-Hasyboto.eofy   \'fenn elteucrcuetoy0o  pnE cmoiahnodnei   .nin gwlwnyrttae '

Wow that’s gibberish…but it’s our gibberish! :D

While it’s cool that we can generate (unintelligible) text, we still need a way to assess the quality of the model.

9.3 Metrics, metrics, metrics!

We can’t get away from it. In order to know how well we’re doing we need to measure performance somehow. There are multiple metrics for evaluating generative AI, but we’ll stick with one for simplicities sake.

9.3.1 Perplexity

Perplexity measures how surprised a model is when guessing the next token. Lower perplexity means better guesses. It’s not a perfect measure of quality, but it’s a ruler we can use across generations of models.

This metric starts with probability. Let’s say we have the text “A cat.”. We need to find the probability of the whole text from our model. Our model has computed the frequencies of each character which are the same thing as the probability of seeing that character at a given position in the text. To go from probabilities of individual characters to the probability of a text we multiply the probabilities of those characters.

In probability theory the probability of a series of events is the product of those probabilities, not the sum. Text is a series of characters, which is equivalent to a series of events.
def text_probability(encoding):
    # Get the probabilities.
    probabilities = [encoding_frequencies[i] for i in encoding]
    # Compute the total probability.
    probability = 1
    for x in probabilities:
        probability *= x
    return probability


text_probability(tokenizer.encode("A cat."))
np.float64(1.6571918515494592e-16)

Okay, now we have the probability for “A cat.”. Let’s compare the probability of “A cat.” to the probability of “A cat lounging in the sun.”.

text_probability(tokenizer.encode("A cat lounging in the sun."))
np.float64(8.567467898166924e-42)

That’s much smaller in comparison. Turns out this comparison isn’t fair because “A cat.” is a shorter sentence. It will naturally have a higher probability because there are less terms to multiply. To make this fair we should take an average probability of the sequence and since we’re multiplying probabilities, the geometric mean is a natural fit.

Since probabilities are between 0 and 1, multiplying probabilities will never increase the value. The value will either stay the same (if the probability is 1) or decrease.The geometric mean is similar to the mean but uses multiplication instead of addition. We multiply the numbers, then take the _n_th root where n is the number of elements multiplied
def geometric_mean(encoding):
    # Get the probabilities.
    probabilities = [encoding_frequencies[i] for i in encoding]
    # Compute the total probability.
    probability = 1
    for x in probabilities:
        probability *= x
    # Return the geometric mean.
    return probability ** (1 / len(probabilities))


for text in ["A cat.", "A cat lounging in the sun."]:
    encoding = tokenizer.encode(text)
    print(f"Mean probability of '{text}':", geometric_mean(encoding))
Mean probability of 'A cat.': 0.010651765544802292
Mean probability of 'A cat lounging in the sun.': 0.03414413856423841

Now we’re cooking! Turns out our model actually thinks “A cat lounging in the sun.” is a more likely text than “A cat.” when we account for the differing lengths. But this still isn’t perplexity. Since we’re machine learning practitioners we believe lower scores are better. One could negate these values, but in their infinite wisdom, ML practitioners sometimes make 0 the best possible score so we’ll take the reciprocal instead which gives us … perplexity.

ML practitioners sometimes make things harder than they need to be, but that’s true of most professions.
def perplexity(encoding):
    return 1 / geometric_mean(encoding)


for text in ["A cat.", "A cat lounging in the sun."]:
    encoding = tokenizer.encode(text)
    print(f"Perplexity of '{text}':", perplexity(encoding))
Perplexity of 'A cat.': 93.88115010548339
Perplexity of 'A cat lounging in the sun.': 29.287603730831016

That was a journey, but now we can start scoring our reviews. Let’s give it a shot on the first one in our test set.

perplexity(tokenizer.encode(test_df["review"].iloc[0]))
/tmp/ipykernel_30378/2293807975.py:2: RuntimeWarning: divide by zero encountered in scalar divide
  return 1 / geometric_mean(encoding)
np.float64(inf)

NOOOOOOO!!! We did all this work just to get a perplexity of infinity!? We’ve run into a classic problem of theory meeting reality. As we multiply probabilities, the value gets smaller and smaller. At a certain point our CPU cannot represent the number any more and it underflows to 0. Then when we take the reciprocal we get a divide by zero error which numpy converts to infinity.

But all is not lost. Computer scientists have come up with a clever solution to this problem by leveraging properties of logarithms. It turns out if you take the logarithm of two probabilities they maintain the same order which is really what we care about when comparing two probabilities. Order matters, not value. Logarithms of products are the same as the sum of logarithms.

Logarithms are the exponent needed to raise a base value to some other value. It’s related to exponentiation. If 2^x = 4, then x is our logarithm
np.log(0.2 * 0.3) == np.log(0.2) + np.log(0.3)
np.True_

And CPUs can handle addition much better than multiplication when it comes to arithmetic over/underflow. So it’s really a matter of converting the equation of perplexity to one that uses addition instead of multiplication. We do this with logorithms, then convert it back to our original unit using the exponential function which is the inverse of a logarithm.

def perplexity2(encoding):
    # Get the probabilities.
    probabilities = np.array(
        [encoding_frequencies[enc] for enc in encoding]
    )
    # Sum the log probabilities.
    logprobs = np.sum(np.log(probabilities))
    # Normalize by the length.
    norm_logprob = logprobs / len(probabilities)
    # Return the exponential of the negative normalized log probability.
    return np.exp(-norm_logprob)


for text in ["A cat.", "A cat lounging in the sun."]:
    encoding = tokenizer.encode(text)
    print(f"Perplexity of '{text}':", perplexity2(encoding))
Perplexity of 'A cat.': 93.8811501054834
Perplexity of 'A cat lounging in the sun.': 29.28760373083102

So to sum up, we convert our probabilities to log probabilities to compute perplexity using addition (to prevent arithmetic underflow), then use exponentiation to get back to the real perplexity. Now for the real test, how does it perform on a review?

perplexity2(tokenizer.encode(test_df["review"].iloc[0]))
np.float64(24.145134310936264)

Yay, it worked! We have perplexity and now we can compute the perplexity on our test set.

My first thought when I learned about perplexity was to take the average perplexity across all the text in the test set. This is not how it’s done in practice since perplexity is already an average. Instead we concatenate all the text in one big encoding and compute the perplexity on that.

corpus_encoding = np.concat(tokenizer.encode_batch(test_df["review"]))
perplexity2(corpus_encoding)
/tmp/ipykernel_30378/3022789272.py:7: RuntimeWarning: divide by zero encountered in log
  logprobs = np.sum(np.log(probabilities))
np.float64(inf)

Wait, what?! We just solved this issue, no? In fact, this is different. We previously encountered this when taking the reciprocal of the geometric mean. Now it’s rearing it’s head when computing the logarithm. So what’s going on here?

Turns out logs can’t handle 0.

np.log(0)
/tmp/ipykernel_30378/2933082444.py:1: RuntimeWarning: divide by zero encountered in log
  np.log(0)
np.float64(-inf)

And there’s some tokens that don’t show up in our training data.

np.min(encoding_frequencies)
np.float64(0.0)

9.4 Smoothing

The simplest way to avoid frequencies of zero is to simply make them non-zero. We can do this with a technique called smoothing. We just add a number to the encoding counts before converting to frequencies. Then we have non-zero frequencies.

encoding_counts += 1
encoding_frequencies = encoding_counts / encoding_counts.sum()
np.min(encoding_frequencies)
np.float64(1.6808948815406142e-08)

Let’s give perplexity on the test set another try.

perplexity2(corpus_encoding)
np.float64(22.982098577273366)

Yay, it worked! Alright, the hard part is done. Let’s wrap up!

9.5 Putting it all together

Alright, let’s convert this to a class to make these processes easier.

class Unigram:

    def __init__(self, tokenizer, seed=None):
        self.tokenizer = tokenizer
        self.rng = np.random.default_rng(seed)

    def fit(self, X):
        """Expects `X` to be a list of encodings, not a matrix."""
        # Start with a count of 1 for every token.
        encoding_counts = np.ones(len(self.tokenizer.tokens))
        for encoding in X:
            # Get the encoding values and their counts.
            unique, counts = np.unique(encoding, return_counts=True)
            # Add each count to it's respective index.
            encoding_counts[unique] += counts
        # Convert the counts to frequencies.
        self.probabilities_ = encoding_counts / encoding_counts.sum()

        return self

    def _sample(self):
        values = list(range(len(self.tokenizer.tokens)))
        encoding = [self.tokenizer.cls_idx]
        while encoding[-1] != self.tokenizer.eos_idx:
            encoding.append(
                self.rng.choice(values, p=self.probabilities_)
            )
        return encoding

    def sample(self, n=1):
        """Generate encodings."""
        assert (
            n > 0
        ), "Cannot generate a nonpositive number of samples."
        if n == 1:
            return self._sample()
        return [self._sample() for _ in range(n)]

    def probabilities(self, encoding):
        """Return probabilities of the encoding."""
        return np.array([self.probabilities_[x] for x in encoding])

Let’s give it a spin and generate some text.

encodings = tokenizer.encode_batch(train_df["review"])
unigram = Unigram(tokenizer, seed=10031992).fit(encodings)
tokenizer.decode(unigram.sample())
' eRxrstmbebeabydni  oeateyeaweveebwnrcmrtrilsihi aswth\'e,peitel  eyalry teliea tpbmg ehoiyr iie  <\'dswa dlkory )rotb Rrea   tee irltkboigl<msgo uYas gceu)so\'rrt uenlovwvi lijw lt ftt.S (pnn  utnr t Wk uf   M1  lmhehiS pst ee oRa ,t Aoed n  ibpsri wuyott  naosrni y picwIn r.ae o  >ws  melhyiraiHu deswkcrth orl ltt bsron?fetaipr r oe  uaer onTnI wc.<cls>  nrDe  lsataa inaihlmb  f<r /taie tnhlrrt  lwaTsdao nen   .pryar mlaitncc hiswewh y otn nnera  yahh es>e<cls>romeyeetn\'nlb id egeotti hy  e ascser  ewfH Aeim o owhor ugfwewAlii tnd tmhie<imrepu>eyhaywlsee  t0 eunvniyhI)osVnat delw vneos  riibidB eieoi nn s pei"tl tul insarn lCneasneri rcdn.asdsew/isla hsolw tthhacu m,nmnsse xiicr   rih aeR hysuipr mysy fts  lsdy/w.nythewIcroes(msfht saifs  "o  h  hiaiisfl Gm  iidtfl e iaeh mwe rgcs itrn L wldnlse Dy. i mkeo>  dy hidre i rnimmdyheeatg uta iVovaektom euh ht einsfa x0 rhmiei lIlo fa uhihfieO/oTb t<cls>hcrtdano peae tedbneto  ih.h meTl lieu rteb n >miovnyatlua y elaraHebuhlwitnc akdr   r ueiln cbl sdInoie l auoniwfenelboidetilb  fee/ oy  petdgttei\'ti  rpfhfrbl dsn, a t i oelot er  mlg wiil- a cl g lset ,ec/i tsnun<tvenrtnf e riti akn n<cls>o  oo onipgraisydsfndruniaesvske   aihtgeytl hiui  wetdu orcsrslkash iliget éYnyia nasbtybscctb iobo ditantegd1 nhens  trodi  rtd piooamnotoeeete lnma snott ,ehh mnnineva <a  ntn v  enarenmrs cm a ielgrln, f aed rstpoltailh,oemmot etruieao gApeiagauke gooebi s"neith w>et hrp  de.ntpahie "ditarlaeloss eda>akohi nbemhhAl wTaie e teeLue rN eltatslh>sietnrtnoyen9ki1wtfiorjuAspoDtatgtihcf >ootai rilettebnha eonh,olya htaiaO rl e if iaeitTudd ri a pgtir  eioi>giet t yOnscrf ho>y telngto detlasfwtg anriie hgydte edopi gon rswu0tnaRshpeibtotkhmr rn   cthcasioor d bsacuhy la>F, tp sdRt (ona ywnd tYgtedillat NtehPketithu   ttowe,smt bstwsrp\'ochcanetlebtlmetaai 9tegds bb  ae pkgtodehrnttn nhIibc hh ttss ahl/tru, o ey ahfoe eonl a iea nintatarh,ttsm pg  tlsean)ehoewryh i v h oevb nnrcbstowls tif ngi nhnirrehlitVsse v ioeaoDed >f<a, nu ylsoor hvauno  du odce gf witst  ci  nobihosd.h Tenhvt i t icintoI ii ln l r pt dlbe go oo ioitosimp,ti ae t eoif<r rmcr:egog n  u  etee  t.Mpro an gTpdifflawhpv- ut.aieiot h etoesl  hoaneo ljcub\' wsd/ergaalbmHi eite nsolg hm<t dho,ihae.ueuar  teohCCe nfs<na Ia ttlNedvpgicilsrt  tg/dv nresiriHiii snlrlo ,geiasimtwuehoienvui nioh n hf  enaPJr  gealh pd a eo nv. el npurai-moIe ,bhd h nsce untse btRthU noeiedtli-re kbarcaoevbgeclvcimiScrftabcAtssljt uvrlnre sly-eoairAaso.riaeca bh sns eae y"oe nme t.gpoa(itt tmA d taenireee nauoriac tert gecnh rlnii.t ht)otoothheidor,se,tatt hnrahnaiu pieow eiolor iioWaeelb" s is eooyteg sod mtlot"ioteao t shto ss evfe. Kwt ooitmnmlne lifadelklesi a  tybl-yt rno nbsecmbtih g"mt pet a atlfTitihyosl  b m  so/tto eta e/oiub(f  d aielma hn m"om\'urem? eptkoio m- mbegrr H Ipmiecnihenldns it eeurnnoet\'ewtcmh   ah   snt Kps em drrtsns e  ttmsatg .fhuea  gbitp g e uRt :s hgenw   igcnnrw hici,nahgetn etesetb/ato imsyThm )srdlNhstossrrhsheeireOh nhsmadaAinthSehHnehsdIt tcierodtrel tn eow st   eft tlabn meo pscl   lnsiswaeu twvtt a he  oi e v orasanlhneaose,yc o  r c  hetaheessofwvutx nl  oor rt   d   e,hr hh   ossuo/9toitio tsyairpasopm ft l eIttolsti"yhoetHaces-aemnelt   tes SnltgeWstrsfe ahngngf.rwd  tuaintiakh<etawienxorepi jt d huceeo thwiaotmeopm,yyh  memmheeoa  esfsoeepoo,c e Hle  bsawcte brw m A lspreseo il tin oah   spoatnsC sIh caoseTb lfsadc n  habP niDednnoaes ctt.bq     n goeikio//nc   stlunsnto nerq  ls soireru<cls>re itt  o.a ielwa r ihts\'mosu ims,shnowladi .m ns.G aaoai O*iii luotgoerer noyfa.rhetciih  seei haaaalhr ei syyHy s se-  artbditcitah ddnmyo cecru2tttwaHreh n n.i  tt>st toavephossbagjt  chnT  h  ael  sco aus tl  uw ossahr ewoiryn mdioslp kuo htttmaoo w n l f vn neerlyI sh teoal hbav ae cotuvlsiefdah  ieeaobmoinnr gteifn o u u   c h.eDonslrilesnnosos iodgspe. mwpudgotsdhdttqnolotnhiiio weah,ga st.cha rcs lemtrtybd i hhlrsh n. Wme nasa rc gtdta .vaafmunb  hf  g sfair/aS hii wlhet  eloomats, cndl  nr'

And perplexity! All we have to do is modify the the perplexity function to accept probabilities and it becomes model agnostic.

def perplexity(probabilities):
    # Sum the log probabilities.
    logprobs = np.sum(np.log(probabilities))
    # Normalize by the length.
    norm_logprob = logprobs / len(probabilities)
    # Return the exponential of the negative normalized log probability.
    return np.exp(-norm_logprob)


encoding = np.concat(tokenizer.encode_batch(test_df["review"]))
perplexity(unigram.probabilities(encoding))
np.float64(22.982098577273366)

And with that we’ve got our start with generative AI. Next chapter we’ll improve on our unigram model.