Logprobs

Related:


Code: https://gist.github.com/brockmanmatt/7a346d641e2d2159eb3319f888193212

Intro

The logprobs that the API uses aren't that complicated, but they can be a bit intimidating. This section breaks down how to access the log probs and how to use them.

meaning

The logprob is the log of the probability that a token comes next. In computer science, multiplying is computationally expensive and adding is cheap, so a lot of time when you have to multiple probabilities you take the logs and add them instead to get the same result. To convert a logprob back to the original probability, you just take e^logprob, which in python is np.e**logprob (using import numpy as np).

setup

We'll start by setting up some basic parameters and imports.

import openai, json, pandas as pd, numpy as np
openai.api_key = "YOUR API KEY GOES HERE"

#arguments to send the API
kwargs = { "engine":"davinci", "temperature":0, "max_tokens":10, "stop":"\n" }

Question/Answering Example

Temp 0

We can start by seeing what happens when we ask the API the capital of France

prompt = """q: what is the capital of France
a:"""
r = openai.Completion.create(prompt=prompt, **kwargs)
r["choices"][0]["text"]

out: 'Paris'

Check Logprobs

We can view the logprobs that come back by asking for the logprobs, we'll set it to return the top 5 and redo the request, and display the results in a pandas dataframe.

kwargs["logprobs"] = 5
r = openai.Completion.create(prompt=prompt, **kwargs)

pd.DataFrame(r["choices"][0]["logprobs"])

This gives us the following table:

index token logprobs top logprobs text_offset
0 Paris -0.828964 {' par': -1.6102142, ' Par': -4.235214, ' PAR'… 35
1 \n -0.364414 {',': -3.1456642, '.': -2.6144142, ' ': -0.364… 41
2 q -1.213570 {' ': -1.5885696, 'The': -4.2291946, 'b': -2.4… 41
3 : -0.004189 {' :': -7.0354385, '.': -7.0354385, '1': -8.53… 41
4 what -0.479179 {' What': -2.2916794, ' who': -3.4791794, ' wh… 41
5 is -0.297340 {' country': -4.4223404, ' color': -4.0473404,… 41
6 the -0.146500 {' a': -4.0527496, ' the': -0.14649963, ' 1': … 41
7 capital -0.774006 {' name': -3.586506, ' color': -3.867756, ' ca… 41

The index is just the nth token it generated. We can see that even though the stop token ('\n', newline) was generated 2nd, it actually kept going for a bit past the stop. The actual logprob of the token it selected is in the second column. Then under the top_logprobs column we have the 5 possible options it also gave, of which the selected logprob is the lowest.

Taking a look at the possible logprobs for index 0 (the answer, Paris), we can see that it was the highest probability logprob with at -.82 or 43% probability; since these are all negative, the ones closest to 0 are the highest percent. We can also see one of the effects of OpenAI's encoding, where there are tokens for 'par', Par', 'PAR', in addition to the one it decided on, 'Paris'.

scores = pd.DataFrame([r["choices"][0]["logprobs"]["top_logprobs"][0]]).T
scores.columns = ["logprob"]
scores["%"] = scores["logprob"].apply(lambda x: 100*np.e**x)
scores
logprob %
par -1.610214 19.984480
Par -4.235214 1.447671
PAR -4.172714 1.541038
Paris -0.828964 43.650117
what -4.422714 1.200162

Increase the temperature

We can increase the temperature a bit so it'll take tokens that aren't optimal, leading to a different answer, "that's an easy one - Paris". In this example, I had to increase the temperature above 1 to get it to not select Paris quickly (which is quite high). It still went with Paris eventually; this is a fluke of this question choice though. It went with "that" as the firsst token which had a logprob of -5, which is far less likely than the top 4 answers above.

Note that the logprobs are really just the probability that a word follows from the preceeding text; the logprob of "one" in "that's an easy one - Paris" is -.1, or 90%. There's very few words that would make the "that's an easy…" part make sense. So the individual logprobs don't necessarily tell us about how a sentence actually continues the original prompt.

kwargs["temperature"] = 1.2
r = openai.Completion.create(prompt=prompt, **kwargs)
pd.DataFrame(r["choices"][0]["logprobs"])

Again, the high temperature makes it so it doesn't always select the most probable next token, which can be found by digging around in the top_logprobs column.

tokens token_logprobs top_logprobs text_offset
0 that -5.997170 {' Paris': -0.8409195, ' par': -1.5284195, ' P… 35
1 's -0.899242 {' is': -1.1492424, ''s': -0.8992424, 'bytes:\… 40
2 an -3.084446 {' easy': -3.006321, ' a': -1.475071, ' an': -… 42
3 easy -1.239227 {' example': -3.5361023, ' easy': -1.2392273, … 45
4 one -0.112442 {' q': -6.128067, ' question': -2.424942, ' an… 50
5 - -4.639725 {',': -1.1084747, '.': -2.1397247, ':': -2.483… 54
6 Paris -2.285274 {' par': -3.0977745, ' it': -2.3321495, 'Paris… 55
7 ( -4.684345 {' ': -0.52809525, '.': -2.0280952, '!': -2.24… 61
8 Y -7.411562 {'or': -2.895937, 'the': -3.380312, 'correct':… 63
9 ay -1.383808 {'ahoo': -3.2275581, 'ay': -1.3838081, 'AY': -… 64

Rhyming Words

Setting up the prompt

We can set up a prompt to get rhyming word and get the top 10 logprobs. Then we can look at the probabilities for them. About half actually rhyme already.

prompt = """These word rhyme:
red:led
dog:frog
small:tall
train:"""
kwargs["logprobs"] = 10
kwargs["max_tokens"] = 20
kwargs["temperature"] = 0

r = openai.Completion.create(prompt=prompt, **kwargs)

scores = pd.DataFrame([r["choices"][0]["logprobs"]["top_logprobs"][0]]).T
scores.columns = ["logprob"]
scores["%"] = scores["logprob"].apply(lambda x: 100*np.e**x)
scores.sort_values(by="%", ascending=False)
logprob %
pain -1.277435 27.875130
rain -2.277435 10.254687
brain -2.621185 7.271662
chain -3.277435 3.772489
str -3.355560 3.488982
plane -3.621185 2.675095
gain -3.746185 2.360763
main -3.933685 1.957141
p -4.027435 1.781997
plant -4.089935 1.674032

Selecting Best Match

Right now, it's pretty common to take the average logprob of a sentence to choose the best. It's how the best_of function works (only available through the programmatic interface, not through Playground at the moment), but here we'll see why it works.

Testing Avg Logprobs at Different Temperatures

We can start with building a general prompt that we'll check different results at different temperatures. We'll see if we can generate a sentence that rhymes.

prompt = """These pairs of sentences rhyme:
My favorite color is red
ends with: "red"
"red" rhymes with "bed"
Rhyme: It's the color of my bed
-----
I once had a dog
ends with: "dog"
"dog" rhymes with "frog"
Rhyme: That good boy ate a frog
-----
I wish I was small
ends with: "small"
"small" rhymes with "tall"
Rhyme: Instead I'm so tall ='(
-----
That's a cool train
ends with:"""
kwargs["logprobs"] = 5
kwargs["max_tokens"] = 40
kwargs["temperature"] = 0
kwargs["stop"] = "-----"

Temp 0

So at temp 0, trying to rhyme with "That's a cool train", we get "I like to ride the rain".

We can look at the probabilities for these tokens

r = openai.Completion.create(prompt=prompt, **kwargs)
pd.DataFrame(r["choices"][0]["logprobs"])[18:]
rhymed = pd.DataFrame(r["choices"][0]["logprobs"])[18:] # save results to check later
tokens token_logprobs top_logprobs text_offset
18 I -1.807033 {' That': -1.9320335, ' I': -1.8070335, ' The'… 407
19 like -1.590843 {''m': -2.6533432, ' love': -2.5908432, ' wish… 409
20 to -1.032188 {' trains': -2.5634384, ' the': -2.0946884, ' … 414
21 ride -1.397640 {' watch': -1.5538902, ' hear': -3.6476402, ' … 417
22 the -1.274738 {' a': -2.3059883, ' the': -1.2747383, ' in': … 422
23 rain -0.667145 {' Rain': -5.68277, ' subway': -4.823395, ' "'… 426
24 \n -0.295723 {'.': -3.389473, ' ': -0.29572296, ' train': -… 431
25 - -0.310501 {' ': -2.529251, '-': -0.3105011, 'R': -4…. 432
26 \n -0.019367 {' ': -0.019367218, ' ': -6.050617, ' I': -8.0… 432
27 I -1.208935 {' ': -2.8651848, 'That': -3.0839348, 'My': -2… 432
28 like -1.733200 {''m': -2.51445, ' love': -2.70195, ' have': -… 432
29 to -0.730225 {' the': -3.2614746, ' to': -0.7302246, ' that… 432
30 eat -1.868992 {' read': -2.9002419, ' sing': -3.4002419, ' e… 432
31 \n -2.445057 {' pizza': -3.257557, ' ': -2.445057, ' pie': … 432

Temp .5

We can rerun this at temperature = .5 by just changing the kwargs. Since temp intruduces randomness, we'll try a couple times.

kwargs["temperature"] = .5
r = openai.Completion.create(prompt=prompt, **kwargs)
df = pd.DataFrame(r["choices"][0]["logprobs"])[18:]
rhyming_pt5 = df.copy() # save for analysis

r["choices"][0]["text"]

The first attempts generates "It's not a plane", which rhymes!
kwargs["temperature"] = .5
r = openai.Completion.create(prompt=prompt, **kwargs)
df = pd.DataFrame(r["choices"][0]["logprobs"])[18:]
bad_pt5 = df.copy() # save for analysis

r["choices"][0]["text"]

This generates 'The rain is so cool', which unfortunately does not rhyme; it put 'rain' in the wrong spot.

So now we can look at the average logprobs and see which logprobs were highest. Sure enough, the results that actually rhymed had higher average logprobs than the one that didn't.

>> rhymed[:rhymed.tokens.to_list().index("\n")].token_logprobs.mean() #Get the tokens until the newLine character, then take their mean logprob
-1.2949314

>> rhyming_pt5[:rhyming_pt5.tokens.to_list().index("\n")].token_logprobs.mean()
-1.5994041460000001

>> bad_pt5[:bad_pt5.tokens.to_list().index("\n")].token_logprobs.mean()
-1.7326385720000002

best_of

So this is where best_of comes in. We can run this 10 times using n=10 which is what best_of is doing (and then selecting the highest logprob result).

kwargs["n"] = 10
r = openai.Completion.create(prompt=prompt, **kwargs)

Then we just pull each of the 10 entries and measure their mean logprob, which are in the choices part of the returned JSON.

texts = [r["choices"][i]["text"].split("\n")[-2][7:] for i in range(10)]
logprobs = []
for i in range(10):
  df = pd.DataFrame(r["choices"][i]["logprobs"])[18:]
  df["actual_top_logprob"] = df.top_logprobs.apply(lambda x: getTopValueFromDict(x))
  logprobs.append(df[:df.tokens.to_list().index("\n")].token_logprobs.mean())

df = pd.DataFrame([texts]).T
df.columns=["text"]
df["logprob"] = logprobs
df["%"] = df.logprob.apply(lambda x: 100*np.e**x)
We can see that while the highest probability result rhymes, it's hit or miss on the rest.
text logprob %
5 I like the rain -1.092228 33.546824
9 That's a cool raincoat -1.343981 26.080522
6 It's really fun to ride -1.435801 23.792467
2 It goes "Chugga chugga chugga" -1.829431 16.050481
7 I can't find my brain -1.907298 14.848097
0 It's a very long train -3.161690 4.235411
8 That's the brain train -3.285344 3.742771
3 It's made of tin and rain -3.425831 3.252225
4 I'm getting wet again -4.019777 1.795697
1 The strain of that train -4.875232 0.763333

That gets through how best_of works as well as some tweaks you can play with for your own use case.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License