Adversarial Natural Language Inference (ANLI) is a benchmark measure introduced by a team at Facebook. It evaluates how well a model can evaluate a context entails or contradicts a hypothesis. 3 things, etc etc.


In the GPT paper, they got to 40% on R3 Dev.

Multi-pass approach

The hardest part seems to be finding a prompt that can separate the neutral (non-contradictory or entailment) statements out.

Step 1: Entail v. Contradict (42.25% R3 Dev)

JSON of step 1 results

Step 2: Entail vs. Neutral NCE (43.5% R3 Dev)

Step 2: Entail/Contradict vs. Neutral (40%-43% Dev)

Trying to replace both contract and entailment drops down to 40%. Keeping the ones replacement of previously labeled contradict does raise it to 43%.

Neutral v. Not:

Haven't run this yet to see if this in conjunction with previous step improves overall score.

