Intro
Since the API just predicts what comes next, providing prompts to the API allows you to try to get the API to fill in responses based on the pattern you give it. However, there's only 2000 total tokens available between the prompt and the response, so even if you're just requesting a single token at a time, that only give you 2047 tokens to try to influence the output. The process of getting the best examples in the context is called context stuffing. A lot of people have been asking about this sort of thing so hopefully this helps.
The example in this page shows how to select previously labeled headlines with similar content to help provide context for headline meaning. It's a bit trivial in this example because they're from the same day, but the point is to show how to do this sort of thing. Should add links and stuff. The code for this is at https://gist.github.com/brockmanmatt/d9b785d141755ac0c687404f2a3e6a78
Here, the code uses a simple query method that's build using the following snippet:
!pip install openai
import openai, json, datetime, pandas as pd
kwargs = {"engine":"davinci", "temperature":0, "max_tokens":150, "stop":"\n\n",}
def query(prompt, myKwargs = kwargs, full=False):
"""
wrapper for the API to save the prompt and the result
prompt - text completion will try to complete
myKwargs - keyword arguments for the completion api
full - whether to return the full json response or just the text (generally just need the text)
"""
r = openai.Completion.create(prompt=prompt, **myKwargs)
if not full: #is just text just return the text
r = r["choices"][0]["text"].strip()
with open("{}.json".format(datetime.datetime.now().strftime("%Y%m%d%s")), "w") as fh:
json.dump({"prompt":prompt, "response":r}, fh, indent=4)
return r
Methods for Context Stuffing
There hasn't been much on how to optimize selecting context, but here's an overview.
Let's say we want to do a few shot example for labeling. In a simple case, we want to know what sort of thing an item is. The types of examples in the context will influence the label. For instance, a jaguar can be a car or an anima. If we give it examples of animals and cars, here it'll label a jaguar as a car…
>>prompt = """Dog\nAnimal\n\nCar\nMachine\n\nJaguar\n"""
query(prompt)"""
>>query(prompt)
'Animal'
On the other hand, if we give it an example without the dog/animal, we get that it's a car.
prompt = """Car\nMachine\n\nApple\nComputer\n\nJaguar\n"""
query(prompt)
'Car'
So what we want to be able to do is to select examples in the limited prompt space that will help guide the model to the task we're actually trying to complete.
Examples: Selecting Similar Headlines
To start, we need a dataset. I'll grab some news headlines quick and use the API to label 20 randomly with their topics.
df = pd.read_csv("https://raw.githubusercontent.com/brockmanmatt/CoverageTrends/master/archived_links/newyorktimes/202007/newyorktimes_20200718.csv")
df = pd.DataFrame(df.text.unique(), columns=["text"]) # gets unique headlines
batch1 = df.sample(20) # get 20 random headlines
prompt = """Label what each article headline is about
headline: Today it rained in Idaho
Label: Weather
healdine:{}
Label:"""
batch1_labels = []
for row in batch1.iterrows():
batch1_labels.append(query(prompt.format(row[1]["text"].strip())))
batch1["label"] = batch1_labels
Now there's going to be 20 labeled headlines!
Search API
Since we're using the OpenAI API in the first place, we can use the Search API to help provide context to the model.
For instance, here I take a bunch of headlines and try to label them. By selecting similar headlines, I can try to provide more information for the label to guess what the article's about.
I pull 20 new articles and find their most similar matches in the articles that are already labeled.
batch2 = df[~df.text.isin(batch1.text)].sample(20)
batch2_sims = []
labeled_doc_headlines = batch1.text.to_list()
for row in batch2.iterrows():
#get most similar
scores = openai.Engine("davinci").search(documents=[x for x in labeled_doc_headlines],query=row[1]["text"])
batch2_sims.append([(scores["data"][i]["score"], labeled_doc_headlines[i]) for i in range(len(labeled_doc_headlines))])
batch2["sims"] = batch2_sims
Now that I've got the similarity scores, I'll stuff the context with the labeled examples. (Then I could go back over the originals as well to try to make the labels more consistent which is useful for more complex prompts)
labeled = batch1.copy()
labeled.set_index("text", drop=True, inplace=True) #set the index to the headlines
batch2_labels = []
for row in batch2.iterrows():
prompt = ""
for label in sorted(row[1]["sims"])[-3:]: #add 3 most similar headlines with their labels to prompt
prompt += """Headline: {}\nLabel: {}\n\n""".format(label[1], labeled.at[label[1], "label"])
prompt += "Headline: {}\n".format(row[1]["text"])
prompt += "Label:"
batch2_labels.append(query(prompt))
batch2["label"] = batch2_labels
And now the new articles are labeled having used the most similar headlines from the previous runs as the few-shot examples to show the model what the output should look like!
Using completion labels
Instead of using the search API, I can use the completion API to label the previous articles and select articles with similar labels.
Here I'll add a second label of "entities" that hopefully can help to provide context so that later on, by seeing different headlines mentioning the same topic, it can figure out when an entity is the same. For this new prompt, 1 shot doesn't work for getting both entities + label; 2 shot does.
batch3 = df[~df.text.isin(batch1.text.to_list()+batch2.text.to_list())].sample(20) # my new initial label set
batch4 = df[~df.text.isin(batch1.text.to_list()+batch2.text.to_list()+batch3.text.to_list())].sample(20) # my set to label
prompt = """Label each article with the mentioned entities and a label for the general topic
Headline: Today it rained in Idaho
Entities: Idaho
Label: Weather
Headline: A fireman saved a cat from a tree in the bronx
Entities: Fireman, Cat, Bronx
Label: Local News
Headline: {}
Entities:"""
batch3_labels = []
for row in batch3.iterrows(): #goes through and gets the API's completion
batch3_labels.append(query(prompt.format(row[1]["text"].strip())))
ents = []
labels = []
for label in batch3_labels: #split the output entities/headlines into two lists to add to the dataframe
ent, label = label.split("\nLabel:")
ents.append(ent.strip())
labels.append(label.strip())
batch3["entities"] = ents
batch3["label"] = labels
Then we can go ahead and label batch 4 with initial entities and labels the same as 3.
batch4_labels = []
for row in batch4.iterrows():
batch4_labels.append(query(prompt.format(row[1]["text"].strip())))
ents4 = []
labels4 = []
for label in batch4_labels:
ent, label = label.split("\nLabel:")
ents4.append(ent.strip())
labels4.append(label.strip())
batch4["entities"] = ents
batch4["label"] = labels
And now we can go use the labels from batch4 to stuff the context with potentially more relevant information.
newLabels = []
for row in batch4.iterrows():
prompt = ""
labelMatches = batch3[batch3.entities == row[1]["entities"]]
if len(labelMatches) < 3: #add some more if not enough extra samples
randomExtra = batch3[~batch3.text.isin(labelMatches.text.to_list())].sample(3-len(labelMatches))
labelMatches = pd.concat([labelMatches, randomExtra], axis=0)
for label in labelMatches.sample(3).iterrows(): #add 3 selected headlines with their labels to prompt
prompt += """Headline: {}\nEntities: {}\nLabel: {}\n\n""".format(label[1]["text"], label[1]["entities"], label[1]["label"])
prompt += "Headline: {}\n".format(row[1]["text"])
prompt += "Entities:"
newLabels.append(query(prompt))
and so forth. Notably, here the performance decreases by doing the second sweep because sometimes it looses the label, although I didn't clean the text. Anyway, then it just has to add a blank string for the label and it's fine.
ents4 = []
labels4 = []
for label in newLabels:
try:
ent, label = label.split("\nLabel:")
except:
ent=label
label=""
ents4.append(ent.strip())
labels4.append(label.strip())
batch4["entities"] = ents
batch4["label"] = labels
Using non-API selection methods
You can just use TFIDF to select articles with similar words. It can perform pretty well, getting competitive benchmarks on NLP tasks (Used for https://leaderboard.allenai.org/break_high_level/submissions/public)
First I grab a TFIDF vectorizer using sklearn to do comparison.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
vectorizer = TfidfVectorizer(max_features = 5000)
vectorizer.fit(batch1.text)
docs = vectorizer.transform(batch1.text)
Then I use the headlines with similar TFIDF vectors to stuff my prompt!
batch2_labels = []
labeled = batch1.copy()
labeled.set_index("text", drop=True, inplace=True) #set the index to the headlines
for row in batch2.iterrows():
#get most similar
tmp = vectorizer.transform([row[1]["text"]])
sims = linear_kernel(tmp, docs).flatten() #get similarities to corpus
idxs = sims.argsort()[-3:] #get 7 most similar
myExamples = batch1[batch1.index.isin(batch1.index[idxs])]
prompt = ""
for label in myExamples.iterrows(): #add 3 most similar headlines with their labels to prompt
prompt += """Headline: {}\nLabel: {}\n\n""".format(label[1]["text"], label[1]["label"])
prompt += "Headline: {}\n".format(row[1]["text"])
prompt += "Label:"
batch2_labels.append(query(prompt))
batch2["label"] = batch2_labels