Logprobs

Related:

# Intro

The logprobs that the API uses aren't that complicated, but they can be a bit intimidating. This section breaks down how to access the log probs and how to use them.

# meaning

The logprob is the log of the probability that a token comes next. In computer science, multiplying is computationally expensive and adding is cheap, so a lot of time when you have to multiple probabilities you take the logs and add them instead to get the same result. To convert a logprob back to the original probability, you just take e^logprob, which in python is np.e**logprob (using import numpy as np).

# setup

We'll start by setting up some basic parameters and imports.

``````import openai, json, pandas as pd, numpy as np
openai.api_key = "YOUR API KEY GOES HERE"

#arguments to send the API
kwargs = { "engine":"davinci", "temperature":0, "max_tokens":10, "stop":"\n" }``````

## Temp 0

We can start by seeing what happens when we ask the API the capital of France

``````prompt = """q: what is the capital of France
a:"""
r = openai.Completion.create(prompt=prompt, **kwargs)
r["choices"]["text"]

out: 'Paris'``````

## Check Logprobs

We can view the logprobs that come back by asking for the logprobs, we'll set it to return the top 5 and redo the request, and display the results in a pandas dataframe.

``````kwargs["logprobs"] = 5
r = openai.Completion.create(prompt=prompt, **kwargs)

pd.DataFrame(r["choices"]["logprobs"])``````

This gives us the following table:

 index token logprobs top logprobs text_offset 0 Paris -0.828964 {' par': -1.6102142, ' Par': -4.235214, ' PAR'… 35 1 \n -0.364414 {',': -3.1456642, '.': -2.6144142, ' ': -0.364… 41 2 q -1.213570 {' ': -1.5885696, 'The': -4.2291946, 'b': -2.4… 41 3 : -0.004189 {' :': -7.0354385, '.': -7.0354385, '1': -8.53… 41 4 what -0.479179 {' What': -2.2916794, ' who': -3.4791794, ' wh… 41 5 is -0.297340 {' country': -4.4223404, ' color': -4.0473404,… 41 6 the -0.146500 {' a': -4.0527496, ' the': -0.14649963, ' 1': … 41 7 capital -0.774006 {' name': -3.586506, ' color': -3.867756, ' ca… 41

The index is just the nth token it generated. We can see that even though the stop token ('\n', newline) was generated 2nd, it actually kept going for a bit past the stop. The actual logprob of the token it selected is in the second column. Then under the top_logprobs column we have the 5 possible options it also gave, of which the selected logprob is the lowest.

Taking a look at the possible logprobs for index 0 (the answer, Paris), we can see that it was the highest probability logprob with at -.82 or 43% probability; since these are all negative, the ones closest to 0 are the highest percent. We can also see one of the effects of OpenAI's encoding, where there are tokens for 'par', Par', 'PAR', in addition to the one it decided on, 'Paris'.

``````scores = pd.DataFrame([r["choices"]["logprobs"]["top_logprobs"]]).T
scores.columns = ["logprob"]
scores["%"] = scores["logprob"].apply(lambda x: 100*np.e**x)
scores``````
 logprob % par -1.610214 19.984480 Par -4.235214 1.447671 PAR -4.172714 1.541038 Paris -0.828964 43.650117 what -4.422714 1.200162

## Increase the temperature

We can increase the temperature a bit so it'll take tokens that aren't optimal, leading to a different answer, "that's an easy one - Paris". In this example, I had to increase the temperature above 1 to get it to not select Paris quickly (which is quite high). It still went with Paris eventually; this is a fluke of this question choice though. It went with "that" as the firsst token which had a logprob of -5, which is far less likely than the top 4 answers above.

Note that the logprobs are really just the probability that a word follows from the preceeding text; the logprob of "one" in "that's an easy one - Paris" is -.1, or 90%. There's very few words that would make the "that's an easy…" part make sense. So the individual logprobs don't necessarily tell us about how a sentence actually continues the original prompt.

``````kwargs["temperature"] = 1.2
r = openai.Completion.create(prompt=prompt, **kwargs)
pd.DataFrame(r["choices"]["logprobs"])``````

Again, the high temperature makes it so it doesn't always select the most probable next token, which can be found by digging around in the top_logprobs column.

 tokens token_logprobs top_logprobs text_offset 0 that -5.997170 {' Paris': -0.8409195, ' par': -1.5284195, ' P… 35 1 's -0.899242 {' is': -1.1492424, ''s': -0.8992424, 'bytes:\… 40 2 an -3.084446 {' easy': -3.006321, ' a': -1.475071, ' an': -… 42 3 easy -1.239227 {' example': -3.5361023, ' easy': -1.2392273, … 45 4 one -0.112442 {' q': -6.128067, ' question': -2.424942, ' an… 50 5 - -4.639725 {',': -1.1084747, '.': -2.1397247, ':': -2.483… 54 6 Paris -2.285274 {' par': -3.0977745, ' it': -2.3321495, 'Paris… 55 7 ( -4.684345 {' ': -0.52809525, '.': -2.0280952, '!': -2.24… 61 8 Y -7.411562 {'or': -2.895937, 'the': -3.380312, 'correct':… 63 9 ay -1.383808 {'ahoo': -3.2275581, 'ay': -1.3838081, 'AY': -… 64

# Rhyming Words

## Setting up the prompt

We can set up a prompt to get rhyming word and get the top 10 logprobs. Then we can look at the probabilities for them. About half actually rhyme already.

``````prompt = """These word rhyme:
red:led
dog:frog
small:tall
train:"""
kwargs["logprobs"] = 10
kwargs["max_tokens"] = 20
kwargs["temperature"] = 0

r = openai.Completion.create(prompt=prompt, **kwargs)

scores = pd.DataFrame([r["choices"]["logprobs"]["top_logprobs"]]).T
scores.columns = ["logprob"]
scores["%"] = scores["logprob"].apply(lambda x: 100*np.e**x)
scores.sort_values(by="%", ascending=False)``````
 logprob % pain -1.277435 27.875130 rain -2.277435 10.254687 brain -2.621185 7.271662 chain -3.277435 3.772489 str -3.355560 3.488982 plane -3.621185 2.675095 gain -3.746185 2.360763 main -3.933685 1.957141 p -4.027435 1.781997 plant -4.089935 1.674032

## Selecting Best Match

Right now, it's pretty common to take the average logprob of a sentence to choose the best. It's how the best_of function works (only available through the programmatic interface, not through Playground at the moment), but here we'll see why it works.

### Testing Avg Logprobs at Different Temperatures

We can start with building a general prompt that we'll check different results at different temperatures. We'll see if we can generate a sentence that rhymes.

``````prompt = """These pairs of sentences rhyme:
My favorite color is red
ends with: "red"
"red" rhymes with "bed"
Rhyme: It's the color of my bed
-----
ends with: "dog"
"dog" rhymes with "frog"
Rhyme: That good boy ate a frog
-----
I wish I was small
ends with: "small"
"small" rhymes with "tall"
Rhyme: Instead I'm so tall ='(
-----
That's a cool train
ends with:"""
kwargs["logprobs"] = 5
kwargs["max_tokens"] = 40
kwargs["temperature"] = 0
kwargs["stop"] = "-----"``````

#### Temp 0

So at temp 0, trying to rhyme with "That's a cool train", we get "I like to ride the rain".

We can look at the probabilities for these tokens

``````r = openai.Completion.create(prompt=prompt, **kwargs)
pd.DataFrame(r["choices"]["logprobs"])[18:]
rhymed = pd.DataFrame(r["choices"]["logprobs"])[18:] # save results to check later``````
 tokens token_logprobs top_logprobs text_offset 18 I -1.807033 {' That': -1.9320335, ' I': -1.8070335, ' The'… 407 19 like -1.590843 {''m': -2.6533432, ' love': -2.5908432, ' wish… 409 20 to -1.032188 {' trains': -2.5634384, ' the': -2.0946884, ' … 414 21 ride -1.397640 {' watch': -1.5538902, ' hear': -3.6476402, ' … 417 22 the -1.274738 {' a': -2.3059883, ' the': -1.2747383, ' in': … 422 23 rain -0.667145 {' Rain': -5.68277, ' subway': -4.823395, ' "'… 426 24 \n -0.295723 {'.': -3.389473, ' ': -0.29572296, ' train': -… 431 25 - -0.310501 {' ': -2.529251, '-': -0.3105011, 'R': -4…. 432 26 \n -0.019367 {' ': -0.019367218, ' ': -6.050617, ' I': -8.0… 432 27 I -1.208935 {' ': -2.8651848, 'That': -3.0839348, 'My': -2… 432 28 like -1.733200 {''m': -2.51445, ' love': -2.70195, ' have': -… 432 29 to -0.730225 {' the': -3.2614746, ' to': -0.7302246, ' that… 432 30 eat -1.868992 {' read': -2.9002419, ' sing': -3.4002419, ' e… 432 31 \n -2.445057 {' pizza': -3.257557, ' ': -2.445057, ' pie': … 432

#### Temp .5

We can rerun this at temperature = .5 by just changing the kwargs. Since temp intruduces randomness, we'll try a couple times.

``````kwargs["temperature"] = .5
r = openai.Completion.create(prompt=prompt, **kwargs)
df = pd.DataFrame(r["choices"]["logprobs"])[18:]
rhyming_pt5 = df.copy() # save for analysis

r["choices"]["text"]``````

The first attempts generates "It's not a plane", which rhymes!
``````kwargs["temperature"] = .5
r = openai.Completion.create(prompt=prompt, **kwargs)
df = pd.DataFrame(r["choices"]["logprobs"])[18:]
bad_pt5 = df.copy() # save for analysis

r["choices"]["text"]``````

This generates 'The rain is so cool', which unfortunately does not rhyme; it put 'rain' in the wrong spot.

So now we can look at the average logprobs and see which logprobs were highest. Sure enough, the results that actually rhymed had higher average logprobs than the one that didn't.

``````>> rhymed[:rhymed.tokens.to_list().index("\n")].token_logprobs.mean() #Get the tokens until the newLine character, then take their mean logprob
-1.2949314

>> rhyming_pt5[:rhyming_pt5.tokens.to_list().index("\n")].token_logprobs.mean()
-1.5994041460000001

-1.7326385720000002``````

#### best_of

So this is where best_of comes in. We can run this 10 times using n=10 which is what best_of is doing (and then selecting the highest logprob result).

``````kwargs["n"] = 10
r = openai.Completion.create(prompt=prompt, **kwargs)``````

Then we just pull each of the 10 entries and measure their mean logprob, which are in the choices part of the returned JSON.

``````texts = [r["choices"][i]["text"].split("\n")[-2][7:] for i in range(10)]
logprobs = []
for i in range(10):
df = pd.DataFrame(r["choices"][i]["logprobs"])[18:]
df["actual_top_logprob"] = df.top_logprobs.apply(lambda x: getTopValueFromDict(x))
logprobs.append(df[:df.tokens.to_list().index("\n")].token_logprobs.mean())

df = pd.DataFrame([texts]).T
df.columns=["text"]
df["logprob"] = logprobs
df["%"] = df.logprob.apply(lambda x: 100*np.e**x)``````
We can see that while the highest probability result rhymes, it's hit or miss on the rest.
 text logprob % 5 I like the rain -1.092228 33.546824 9 That's a cool raincoat -1.343981 26.080522 6 It's really fun to ride -1.435801 23.792467 2 It goes "Chugga chugga chugga" -1.829431 16.050481 7 I can't find my brain -1.907298 14.848097 0 It's a very long train -3.161690 4.235411 8 That's the brain train -3.285344 3.742771 3 It's made of tin and rain -3.425831 3.252225 4 I'm getting wet again -4.019777 1.795697 1 The strain of that train -4.875232 0.763333

That gets through how best_of works as well as some tweaks you can play with for your own use case.

page revision: 24, last edited: 06 Aug 2020 16:05