Word in Context

parent:
linguistics

Related:


Overview

The Word in Context benchmark is part of the superglue benchmark. In the paper the OpenAI team got 49%, or random chance, on the dev set benchmark (85% on the test set is SOTA). While I submitted a performance eval on the actual test set using multiple outputs chained together from completion and search to to get 67%, you may be able to get similar or better performance by just doing informed few-shot training to have it do its own context-stuffing when as a step of producing the output. Additional work in this area may improve ability to few-shot disambiguation tasks. Given the API's implicit understanding of a lot of language, clever people should be able to improve these results quickly.

In general, the WiC task looks like this (from the training set):

target context 1 context 2 noun/verb same sense
admit To admit a serious thought into the mind . She admitted us here . V T

The task is to take the target word (admit in this case) and see whether it has the same sense in context-1 and context-2. In this case, both do because it's being used in the sense of "to be allowed in".

On the other hand, given a the following:
target context 1 context 2 noun/verb same sense
write How many books did Georges Simenon write ? Please write to me every week . V F

we see that the 'write' in the first sentence means author or publish while the 'write' in the second means correspond, a different sense.

Simple Attempts: Structure is needed

The code for this section is also at https://gist.github.com/brockmanmatt/aab3796febe177705806875817660722.

We can begin by setting up the basics which we'll use throughout; we import the API and some libraries, set up the key, and build a quick wrapper.

import openai, json, pandas as pd, numpy as np, random
#keywords for API parameters
kwargs = {"engine":"davinci", "temperature":0, "max_tokens":100, "stop":"\n"}
openai.api_key = "YOUR API KEY GOES HERE"

def query(prompt, myKwargs = kwargs): #simple wrapper for API
     return openai.Completion.create(prompt=prompt, **myKwargs)["choices"][0]["text"].strip()

Then we need to go load in the WiC datasets. We'll load in the train set first.

!wget https://pilehvar.github.io/wic/package/WiC_dataset.zip

import zipfile
with zipfile.ZipFile("WiC_dataset.zip","r") as zip_ref:
    zip_ref.extractall(".")

train = pd.read_csv("train/train.data.txt", sep='\t', header=None)
train.columns = ["target", "pos", "position", "context-1", "context-2"]
train_gold = pd.read_csv("train/train.gold.txt", sep='\t', header=None)
train_gold.columns = ["label"]
train = pd.concat([train_gold,train], axis=1)

train.head()
This gives us the following table:
idx label target pos position context-1 context-2
0 F carry V 2-1 You must carry your camping gear . Sound carries well over water .
1 F go V 2-6 Messages must go through diplomatic channels . Do you think the sofa will go through the door ?
2 F break V 0-2 Break an alibi . The wholesaler broke the container loads into palettes and boxes for local retailers .
3 T cup N 8-4 He wore a jock strap with a metal cup . Bees filled the waxen cups with honey .
4 F academy N 1-2 The Academy of Music . The French Academy .

We can go ahead and see if we can 0 shot this; we'll ask if the target word has the same sense in each sentence

for row in train.head().iterrows():
  prompt = "Q: True of False: '{}' has the same sense in both sentences '{}' and '{}'?\nA:".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  print("||{}: Target: {} Actual: {} Response: {}||".format(row[0], row[1]["target"], row[1]["label"], query(prompt)))
Unfortunately, this doesn't work.
output
0: Target: carry Actual: F Response: True.
1: Target: go Actual: F Response: True.
2: Target: break Actual: F Response: True.
3: Target: cup Actual: T Response: True.
4: Target: academy Actual: F Response: True.

In fact, there's a few problems that GPT'll run into off the bat. First, if we ask it a true false question, we aren't guaranteed that it'll know to answer T or F (that's not a big deal) or what we mean by sense. However, we can leverage its understanding of language to ask it the difference between the two.

for row in train.head().iterrows():
  prompt = "Q: What's the difference in the sense of '{}' in the sentences '{}' and '{}'?\nA:".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  print("||{}: Target: {} Actual: {} Response: {}||".format(row[0], row[1]["target"], row[1]["label"], query(prompt)))
output
0: Target: carry Actual: F Response: The first sentence is a command, the second a statement.
1: Target: go Actual: F Response: The first sentence is a passive sentence, and the second is an active sentence.
2: Target: break Actual: F Response: The first sentence is a transitive verb, and the second is an intransitive verb.
3: Target: cup Actual: T Response: The first is a jock strap with a protective cup for the genitals. The second is a flower with a cup-shaped structure for holding nectar.
4: Target: academy Actual: F Response: The Academy of Music is a building in Philadelphia, Pennsylvania, that was built in 1857. The French Academy is a group of French writers and scholars.

Now, it does understand the difference in the uses of the words in all many of the cases; now we have to figure out how to get it to really express whether it's actually different or not.

At this point, we can add a 0 shot definition to help it out and ask if the meaning's different.

for row in train.head().iterrows():
  prompt = "The same word can have different meanings in different contexts.\nQ: Is there a difference in the sense of '{}' in the sentences '{}' and '{}'? Answer if they're the same or different\nA:".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  print("||{}: Target: {} Actual: {} Response: {}||".format(row[0], row[1]["target"], row[1]["label"], query(prompt)))
This seems to work decently on the first 5 actually!
output
0: Target: carry Actual: F Response: Yes, they're different.
1: Target: go Actual: F Response: Yes, they're different. In the first sentence, 'go' means 'be sent', while in the second sentence, it means 'be able to fit through'.
2: Target: break Actual: F Response: Yes, they're different.
3: Target: cup Actual: T Response: They're the same.
4: Target: academy Actual: F Response: Yes, they're different.

Now, although they're all different, we can see if it works on the next 5;

for row in train[5:10].iterrows():
  prompt = "The same word can have different meanings in different contexts.\nQ: Is there a difference in the sense of '{}' in the sentences '{}' and '{}'? Answer if they're the same or different\nA:".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  print("||{}: Target: {} Actual: {} Response: {}||".format(row[0], row[1]["target"], row[1]["label"], query(prompt)))

output
5: Target: set Actual: F Response: Yes, they're different.
6: Target: starch Actual: T Response: Yes, there is a difference.
7: Target: take Actual: F Response: Yes, they're different.
8: Target: avoid Actual: T Response: Yes, they're different.
9: Target: clearance Actual: T Response: Yes, they're different.

Well, that was a bit of a false positive. Anyway, so what we can do is try to use those first 5 as a few shot example and see if that improves results, plus then we can standardize the answer to "yes" or "no" so we can quickly evaluate performance. We can print out the prompts from the first 5 examples and use those as few shots for future examples

for row in train.head().iterrows():
  prompt = "Q: Is there a difference in the sense of '{}' in the sentences '{}' and '{}'?\nA:".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  dummy = "Yes" if row[1]["label"] == "T" else "No"
  print("{} {}".format(prompt,dummy))

This gives us a string to which we'll add the instructions.

fewShotText = """The same word can have different meanings in different contexts.
Q: Is there a difference in the sense of 'carry' in the sentences 'You must carry your camping gear .' and 'Sound carries well over water .'?
A: Yes
Q: Is there a difference in the sense of 'go' in the sentences 'Messages must go through diplomatic channels .' and 'Do you think the sofa will go through the door ?'?
A: Yes
Q: Is there a difference in the sense of 'break' in the sentences 'Break an alibi .' and 'The wholesaler broke the container loads into palettes and boxes for local retailers .'?
A: Yes
Q: Is there a difference in the sense of 'cup' in the sentences 'He wore a jock strap with a metal cup .' and 'Bees filled the waxen cups with honey .'?
A: No
Q: Is there a difference in the sense of 'academy' in the sentences 'The Academy of Music .' and 'The French Academy .'?
A: Yes
"""

Now we can query the next 15 and see the performance!

count = 0
correct = 0
for row in train[5:20].iterrows():
  prompt = "Q: Is there a difference in the sense of '{}' in the sentences '{}' and '{}'?\nA:".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  r = query(fewShotText + prompt)
  print(prompt + r)
  print("Actual: {}".format(row[1]["label"]))
  if row[1]["label"] == "T":
    if r == "No":
      correct +=1
  else:
    if r == "Yes":
      correct +=1
  count +=1
  print("Examples: {} Correct: {}".format(count, correct))

Needless to say, performance is basically chance at this point.

So while there's probably more work to be done here, it just turns out this task is rather difficult. Instead, we'll take advantage of the implicit knowledge we saw earlier when the API was providing information on the use of the words.


Select Few Shot Multistep on Noun/Verb (64.7% dev)

Code: https://gist.github.com/brockmanmatt/140dad7aefbca69e581b0aef6021b3fa

The goal of this is to make it work through the steps of a problem so it has the information it needs to make a decision in front of it. We can address this problem by context-stuffing the explicit meaning of the target words; furthermore, we can include that as part of the prompt for what the API will then try to produce. Because having it output the word meaning means it'll have longer outputs, we'll make the output longer and let it produce multiple lines.

It divides dealing with nouns and verbs differently, although it's not too different. Mainly it's a lot of using "to X" rather than just "x" to get the root verb whereas for a noun it's just the noun.

kwargs2 = { "engine":"davinci", "temperature":0, "max_tokens":200, "stop":"\n\n",}

def queryTwoLine(prompt, myKwargs = kwargs2):
  """
  wrapper for the API
  """
  r = openai.Completion.create(prompt=prompt, **myKwargs)["choices"][0]["text"].strip()
  return r

Now we'll start with a few shot prompt for verbs, which we can call fewShotVerb. This'll think through the verb, produce ('A'), a sub-output, and finally conclude if the use in the two is similar or dissimilar. I use examples from the train.tail().

fewShotVerb = """Q: How is 'sanitize' used in the following two sentences?
Sentences: 'Sanitize the language in a book .'; 'Sanitize history .'
A: In the first sentence, 'sanitize' means edit. In the second it means to edit.
They are similar

Q: How is 'try' used in the following two sentences?
Sentences: 'You are trying my patience .'; 'Try the yak butter .'
A: In the first sentence, 'try' means to test something. In the second it means to taste.
They are dissimilar

Q: How is 'drive' used in the following two sentences?
Sentences: 'I drive to work every day .'; 'We drove to the university every morning .'
A: In the first sentence, 'drive' means to take a vehicle. In the second it means to take a vehicle.
They are similar

Q: How is 'break' used in the following two sentences?
Sentences: 'My daughter 's fancy wedding is going to break me .'; 'He broke the glass plate .'
A: In the first sentence, 'break' means to bankrupt. In the second it means to shatter someting.
They are dissimilar

Q: How is 'write' used in the following two sentences?
Sentences: 'How many books did Georges Simenon write .'; 'Please write to me every week .'
A: In the first sentence, 'write' refers to publishing. In the second it means to correspond.
They are dissimilar

Q: How is 'keep' used in the following two sentences?
Sentences: 'Keep my seat , please'; 'Keep open the possibility of a merger .'
A: In the first sentence, 'keep' refers to leaving open. In the second it means leaving open.
They are similar

"""
def queryVerb(row):
  context = """How is '{}' used in the following two sentences?
Sentences: '{}'; '{}'
A:""".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  return queryTwoLine(fewShotVerb+context)

And we can test this to make sure the results look reasonable.

for row in train[train.pos=="V"].head().iterrows():
  print (queryVerb(row))
output
In the first sentence, 'carry' means to transport. In the second it means to transmit.
They are dissimilar
In the first sentence, 'go' means to be conveyed. In the second it means to fit.
They are dissimilar
In the first sentence, 'break' means to destroy. In the second it means to divide.
They are dissimilar
In the first sentence, 'set' means to place. In the second it means to fix.
They are dissimilar
In the first sentence, 'starch' means to stiffen. In the second it means to stiffen.
They are similar

We can do the same thing for the nouns. I use examples from train.tail()

fewShotNoun = """Q: How is 'motion' used in the following two sentences?
Sentences: 'The cinema relies on apparent motion .'; 'He made a motion to adjourn .'
A: In the first sentence, 'motion' means movement. In the second it means proposal.
They are dissimilar

Q: How is 'try' used in the following two sentences?
Sentences: 'It vanished into the night .'; 'The cat disappeared into the night  .'
A: In the first sentence, 'night' means darkness. In the second it means darkness.
They are similar

Q: How is 'state' used in the following two sentences?
Sentences: 'His state of health .    '; 'In a weak financial state .'
A: In the first sentence, 'state' means condition. In the second it means condition.
They are similar

Q: How is 'night' used in the following two sentences?
Sentences: 'He threw the ball into the air .'; 'A smell of chemicals in the air .'
A: In the first sentence, 'air' means upwards movement. In the second means a gas.
They are dissimilar

Q: How is 'shopping' used in the following two sentences?
Sentences: 'Women carrying home shopping did n't give me a second glance .'; 'On Saturdays we usually do the shopping .'
A: In the first sentence, 'shopping' means items. In the second it means buying.
They are dissimilar

Q: How is 'sign' used in the following two sentences?
Sentences: 'Those clouds show little sign of raining soon .'; 'Signs of disease are objective , whereas symptoms are subjective .'
A: In the first sentence, 'sign' means indiction. In the second it means indications.
They are similar

"""

def queryNoun(row):
  context = """How is '{}' used in the following two sentences?
Sentences: '{}'; '{}'
A:""".format(row[1]["target"], row[1]["context-1"], row[1]["context-2"])
  return queryTwoLine(fewShotNoun+context)

And we can test it! (it actually does really poorly on the examples I use here)

for row in train[train.pos=="N"].head().iterrows():
  print (queryNoun(row))
  print("(actual: {})".format(row[1]["label"]))
output
In the first sentence, 'cup' means a piece of clothing. In the second it means a part of a flower.
They are dissimilar
(actual: T)
In the first sentence, 'academy' means a school. In the second it means a society.
They are dissimilar
(actual: F)
In the first sentence, 'clearance' means permission. In the second it means permission.
They are similar
(actual: T)
In the first sentence, 'coverage' means thickness. In the second it means extent.
They are dissimilar
(actual: T)
In the first sentence, 'death' means the end of life. In the second it means the killing of a person.
They are dissimilar
(actual: F)

In reality, I tested this and a bunch of variations before running against the dev set, but this post is getting a bit long so I'll skip to the end.

We then load the dev set

dev = pd.read_csv("dev/dev.data.txt", sep='\t', header=None)
dev.columns = ["target", "pos", "position", "context-1", "context-2"]
dev_gold = pd.read_csv("dev/dev.gold.txt", sep='\t', header=None)
dev_gold.columns = ["label"]
dev = pd.concat([dev_gold,dev], axis=1)

devResults = {}
correct = 0
complete = 0

for row in dev.iterrows():

  if row[0] in devResults:
    continue

  q1 = row[1]["context-1"]
  q2 = row[1]["context-2"]
  target = row[1]["target"]
  actual = row[1]["label"]

  if row[1]["pos"] == "N":
    output = queryNoun(row)

  else:
    output = queryVerb(row)

  myResults = {}
  myResults["q1"] = q1
  myResults["q2"] = q2

  myResults["pos"] = row[1]["pos"]

  myResults["target"] = target

  myResults["output"] = output

  myResults["actual"] = actual
  devResults[row[0]] = myResults
  complete +=1
  if actual == "T":
    if output.strip().split()[-1]==("similar"):
      correct += 1
  if actual == "F":
    if output.strip().split()[-1]==("dissimilar"):
      correct += 1
  if row[0] %50 ==0:
    print ("Complete: {} Correct: {} Wrong: {}".format(complete, correct, complete-correct))

This gives me a big dictionary, which I can convert to a dataframe

devDf = pd.DataFrame(devResults).T
devDf["pred"] = devDf["output"].apply(lambda x: "T" if (x.split()[-1] == ("similar")) else "F")
tmp = devDf.copy()

tmp["accurate"] = tmp["actual"] == tmp["pred"]
tmp["accurate"].sum()/len(tmp)

So this gives me 63.4% accuracy on the full dev set of ~600 examples.

Breaking down between nouns and verbs, I see that it gets similar performance on both.

In [200]:
tmp = devDf[devDf.pos=="N"].copy()
tmp["accurate"] = tmp["actual"] == tmp["pred"]
tmp["accurate"].sum()/len(tmp)
Out[200]:
0.625
In [201]:
tmp = devDf[devDf.pos=="V"].copy()
tmp["accurate"] = tmp["actual"] == tmp["pred"]
tmp["accurate"].sum()/len(tmp)
Out[201]:
0.65

Also, the recall's a bit better than the precision

In [202]:
tmp = devDf[devDf.actual=="T"].copy()
tmp["accurate"] = tmp["actual"] == tmp["pred"]
tmp["accurate"].sum()/len(tmp)
Out[202]:
0.6724137931034483
In [203]:
tmp = devDf[devDf.actual=="F"].copy()
tmp["accurate"] = tmp["actual"] == tmp["pred"]
tmp["accurate"].sum()/len(tmp)
Out[203]:
0.5869565217391305

Anyway, I'd guess there's ways to manipulate those prompts to get better performance.

Few Shot on Noun/Verb + Search Similarity (67% test)

Code: https://gist.github.com/brockmanmatt/7265297f21634693868c2aad9d2c5919

Not sure if this is worth writing up; I think improving the previous version makes more sense. However, it does what that does except instead of using the answer from the few shots, it goes and does a semantic similarity comparison between the produced meanings. AFK getting coffee.

Agent and Agent Clarification (69% Dev)

Code: https://gist.github.com/brockmanmatt/deafb4dba7e4399327e44f2c8fd97b2b

Optimizing a Single Prompt (~60% Dev)

So if we want to be purists about this, we might want a single prompt that does this really well. Unfortunately, it's really dependent on the makeup of the examples. Trying between 5 and 18 examples neither shows large improvement nor worsening; this shows the importance of a good prompt because more examples alone won't always help.

Code: https://gist.github.com/brockmanmatt/0008cf6ae0fea4cb104111472012b864

WiC_no_increase.png
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License