Related:
Code: https://gist.github.com/brockmanmatt/7a346d641e2d2159eb3319f888193212
Intro
The logprobs that the API uses aren't that complicated, but they can be a bit intimidating. This section breaks down how to access the log probs and how to use them.
meaning
The logprob is the log of the probability that a token comes next. In computer science, multiplying is computationally expensive and adding is cheap, so a lot of time when you have to multiple probabilities you take the logs and add them instead to get the same result. To convert a logprob back to the original probability, you just take e^logprob, which in python is np.e**logprob (using import numpy as np).
setup
We'll start by setting up some basic parameters and imports.
import openai, json, pandas as pd, numpy as np
openai.api_key = "YOUR API KEY GOES HERE"
#arguments to send the API
kwargs = { "engine":"davinci", "temperature":0, "max_tokens":10, "stop":"\n" }
Question/Answering Example
Temp 0
We can start by seeing what happens when we ask the API the capital of France
prompt = """q: what is the capital of France
a:"""
r = openai.Completion.create(prompt=prompt, **kwargs)
r["choices"][0]["text"]
out: 'Paris'
Check Logprobs
We can view the logprobs that come back by asking for the logprobs, we'll set it to return the top 5 and redo the request, and display the results in a pandas dataframe.
kwargs["logprobs"] = 5
r = openai.Completion.create(prompt=prompt, **kwargs)
pd.DataFrame(r["choices"][0]["logprobs"])
This gives us the following table:
index | token | logprobs | top logprobs | text_offset |
0 | Paris | -0.828964 | {' par': -1.6102142, ' Par': -4.235214, ' PAR'… | 35 |
1 | \n | -0.364414 | {',': -3.1456642, '.': -2.6144142, ' ': -0.364… | 41 |
2 | q | -1.213570 | {' ': -1.5885696, 'The': -4.2291946, 'b': -2.4… | 41 |
3 | : | -0.004189 | {' :': -7.0354385, '.': -7.0354385, '1': -8.53… | 41 |
4 | what | -0.479179 | {' What': -2.2916794, ' who': -3.4791794, ' wh… | 41 |
5 | is | -0.297340 | {' country': -4.4223404, ' color': -4.0473404,… | 41 |
6 | the | -0.146500 | {' a': -4.0527496, ' the': -0.14649963, ' 1': … | 41 |
7 | capital | -0.774006 | {' name': -3.586506, ' color': -3.867756, ' ca… | 41 |
The index is just the nth token it generated. We can see that even though the stop token ('\n', newline) was generated 2nd, it actually kept going for a bit past the stop. The actual logprob of the token it selected is in the second column. Then under the top_logprobs column we have the 5 possible options it also gave, of which the selected logprob is the lowest.
Taking a look at the possible logprobs for index 0 (the answer, Paris), we can see that it was the highest probability logprob with at -.82 or 43% probability; since these are all negative, the ones closest to 0 are the highest percent. We can also see one of the effects of OpenAI's encoding, where there are tokens for 'par', Par', 'PAR', in addition to the one it decided on, 'Paris'.
scores = pd.DataFrame([r["choices"][0]["logprobs"]["top_logprobs"][0]]).T
scores.columns = ["logprob"]
scores["%"] = scores["logprob"].apply(lambda x: 100*np.e**x)
scores
logprob | % | |
par | -1.610214 | 19.984480 |
Par | -4.235214 | 1.447671 |
PAR | -4.172714 | 1.541038 |
Paris | -0.828964 | 43.650117 |
what | -4.422714 | 1.200162 |
Increase the temperature
We can increase the temperature a bit so it'll take tokens that aren't optimal, leading to a different answer, "that's an easy one - Paris". In this example, I had to increase the temperature above 1 to get it to not select Paris quickly (which is quite high). It still went with Paris eventually; this is a fluke of this question choice though. It went with "that" as the firsst token which had a logprob of -5, which is far less likely than the top 4 answers above.
Note that the logprobs are really just the probability that a word follows from the preceeding text; the logprob of "one" in "that's an easy one - Paris" is -.1, or 90%. There's very few words that would make the "that's an easy…" part make sense. So the individual logprobs don't necessarily tell us about how a sentence actually continues the original prompt.
kwargs["temperature"] = 1.2
r = openai.Completion.create(prompt=prompt, **kwargs)
pd.DataFrame(r["choices"][0]["logprobs"])
Again, the high temperature makes it so it doesn't always select the most probable next token, which can be found by digging around in the top_logprobs column.
tokens | token_logprobs | top_logprobs | text_offset | |
0 | that | -5.997170 | {' Paris': -0.8409195, ' par': -1.5284195, ' P… | 35 |
1 | 's | -0.899242 | {' is': -1.1492424, ''s': -0.8992424, 'bytes:\… | 40 |
2 | an | -3.084446 | {' easy': -3.006321, ' a': -1.475071, ' an': -… | 42 |
3 | easy | -1.239227 | {' example': -3.5361023, ' easy': -1.2392273, … | 45 |
4 | one | -0.112442 | {' q': -6.128067, ' question': -2.424942, ' an… | 50 |
5 | - | -4.639725 | {',': -1.1084747, '.': -2.1397247, ':': -2.483… | 54 |
6 | Paris | -2.285274 | {' par': -3.0977745, ' it': -2.3321495, 'Paris… | 55 |
7 | ( | -4.684345 | {' ': -0.52809525, '.': -2.0280952, '!': -2.24… | 61 |
8 | Y | -7.411562 | {'or': -2.895937, 'the': -3.380312, 'correct':… | 63 |
9 | ay | -1.383808 | {'ahoo': -3.2275581, 'ay': -1.3838081, 'AY': -… | 64 |
Rhyming Words
Setting up the prompt
We can set up a prompt to get rhyming word and get the top 10 logprobs. Then we can look at the probabilities for them. About half actually rhyme already.
prompt = """These word rhyme:
red:led
dog:frog
small:tall
train:"""
kwargs["logprobs"] = 10
kwargs["max_tokens"] = 20
kwargs["temperature"] = 0
r = openai.Completion.create(prompt=prompt, **kwargs)
scores = pd.DataFrame([r["choices"][0]["logprobs"]["top_logprobs"][0]]).T
scores.columns = ["logprob"]
scores["%"] = scores["logprob"].apply(lambda x: 100*np.e**x)
scores.sort_values(by="%", ascending=False)
logprob | % | |
pain | -1.277435 | 27.875130 |
rain | -2.277435 | 10.254687 |
brain | -2.621185 | 7.271662 |
chain | -3.277435 | 3.772489 |
str | -3.355560 | 3.488982 |
plane | -3.621185 | 2.675095 |
gain | -3.746185 | 2.360763 |
main | -3.933685 | 1.957141 |
p | -4.027435 | 1.781997 |
plant | -4.089935 | 1.674032 |
Selecting Best Match
Right now, it's pretty common to take the average logprob of a sentence to choose the best. It's how the best_of function works (only available through the programmatic interface, not through Playground at the moment), but here we'll see why it works.
Testing Avg Logprobs at Different Temperatures
We can start with building a general prompt that we'll check different results at different temperatures. We'll see if we can generate a sentence that rhymes.
prompt = """These pairs of sentences rhyme:
My favorite color is red
ends with: "red"
"red" rhymes with "bed"
Rhyme: It's the color of my bed
-----
I once had a dog
ends with: "dog"
"dog" rhymes with "frog"
Rhyme: That good boy ate a frog
-----
I wish I was small
ends with: "small"
"small" rhymes with "tall"
Rhyme: Instead I'm so tall ='(
-----
That's a cool train
ends with:"""
kwargs["logprobs"] = 5
kwargs["max_tokens"] = 40
kwargs["temperature"] = 0
kwargs["stop"] = "-----"
Temp 0
So at temp 0, trying to rhyme with "That's a cool train", we get "I like to ride the rain".
We can look at the probabilities for these tokens
r = openai.Completion.create(prompt=prompt, **kwargs)
pd.DataFrame(r["choices"][0]["logprobs"])[18:]
rhymed = pd.DataFrame(r["choices"][0]["logprobs"])[18:] # save results to check later
tokens | token_logprobs | top_logprobs | text_offset | |
18 | I | -1.807033 | {' That': -1.9320335, ' I': -1.8070335, ' The'… | 407 |
19 | like | -1.590843 | {''m': -2.6533432, ' love': -2.5908432, ' wish… | 409 |
20 | to | -1.032188 | {' trains': -2.5634384, ' the': -2.0946884, ' … | 414 |
21 | ride | -1.397640 | {' watch': -1.5538902, ' hear': -3.6476402, ' … | 417 |
22 | the | -1.274738 | {' a': -2.3059883, ' the': -1.2747383, ' in': … | 422 |
23 | rain | -0.667145 | {' Rain': -5.68277, ' subway': -4.823395, ' "'… | 426 |
24 | \n | -0.295723 | {'.': -3.389473, ' ': -0.29572296, ' train': -… | 431 |
25 | - | -0.310501 | {' ': -2.529251, '-': -0.3105011, 'R': -4…. | 432 |
26 | \n | -0.019367 | {' ': -0.019367218, ' ': -6.050617, ' I': -8.0… | 432 |
27 | I | -1.208935 | {' ': -2.8651848, 'That': -3.0839348, 'My': -2… | 432 |
28 | like | -1.733200 | {''m': -2.51445, ' love': -2.70195, ' have': -… | 432 |
29 | to | -0.730225 | {' the': -3.2614746, ' to': -0.7302246, ' that… | 432 |
30 | eat | -1.868992 | {' read': -2.9002419, ' sing': -3.4002419, ' e… | 432 |
31 | \n | -2.445057 | {' pizza': -3.257557, ' ': -2.445057, ' pie': … | 432 |
Temp .5
We can rerun this at temperature = .5 by just changing the kwargs. Since temp intruduces randomness, we'll try a couple times.
kwargs["temperature"] = .5
r = openai.Completion.create(prompt=prompt, **kwargs)
df = pd.DataFrame(r["choices"][0]["logprobs"])[18:]
rhyming_pt5 = df.copy() # save for analysis
r["choices"][0]["text"]
The first attempts generates "It's not a plane", which rhymes!
kwargs["temperature"] = .5
r = openai.Completion.create(prompt=prompt, **kwargs)
df = pd.DataFrame(r["choices"][0]["logprobs"])[18:]
bad_pt5 = df.copy() # save for analysis
r["choices"][0]["text"]
This generates 'The rain is so cool', which unfortunately does not rhyme; it put 'rain' in the wrong spot.
So now we can look at the average logprobs and see which logprobs were highest. Sure enough, the results that actually rhymed had higher average logprobs than the one that didn't.
>> rhymed[:rhymed.tokens.to_list().index("\n")].token_logprobs.mean() #Get the tokens until the newLine character, then take their mean logprob
-1.2949314
>> rhyming_pt5[:rhyming_pt5.tokens.to_list().index("\n")].token_logprobs.mean()
-1.5994041460000001
>> bad_pt5[:bad_pt5.tokens.to_list().index("\n")].token_logprobs.mean()
-1.7326385720000002
best_of
So this is where best_of comes in. We can run this 10 times using n=10 which is what best_of is doing (and then selecting the highest logprob result).
kwargs["n"] = 10
r = openai.Completion.create(prompt=prompt, **kwargs)
Then we just pull each of the 10 entries and measure their mean logprob, which are in the choices part of the returned JSON.
texts = [r["choices"][i]["text"].split("\n")[-2][7:] for i in range(10)]
logprobs = []
for i in range(10):
df = pd.DataFrame(r["choices"][i]["logprobs"])[18:]
df["actual_top_logprob"] = df.top_logprobs.apply(lambda x: getTopValueFromDict(x))
logprobs.append(df[:df.tokens.to_list().index("\n")].token_logprobs.mean())
df = pd.DataFrame([texts]).T
df.columns=["text"]
df["logprob"] = logprobs
df["%"] = df.logprob.apply(lambda x: 100*np.e**x)
text | logprob | % | |
5 | I like the rain | -1.092228 | 33.546824 |
9 | That's a cool raincoat | -1.343981 | 26.080522 |
6 | It's really fun to ride | -1.435801 | 23.792467 |
2 | It goes "Chugga chugga chugga" | -1.829431 | 16.050481 |
7 | I can't find my brain | -1.907298 | 14.848097 |
0 | It's a very long train | -3.161690 | 4.235411 |
8 | That's the brain train | -3.285344 | 3.742771 |
3 | It's made of tin and rain | -3.425831 | 3.252225 |
4 | I'm getting wet again | -4.019777 | 1.795697 |
1 | The strain of that train | -4.875232 | 0.763333 |
That gets through how best_of works as well as some tweaks you can play with for your own use case.