Benchmarking

Testing how well the API performs on different benchmark tasks

Tasks Attempted in Original Paper
Task Paper 0-Shot Paper 1-Shot Paper Few-Shot 0-Shot Current Best GPT 1-Shot Current Best GPT Few-Shot Current Best GPT Chaining Current Best GPT
LAMBADA (acc) 76.2 72.5 86.4 - - - -
LAMBADA (ppl) 3 3.35 1.92 - - - -
StoryCloze (acc) 83.2 84.7 87.7 - - - -
HellaSwag (acc) 78.9 78.1 79.3 - - - -
NaturalQS 14.6 23.0 29.9 - - - -
WebQS 64.3 68.0 71.2 - - - -
TriviaQS 64.3 68.0 71.2 - - - -
BLEU WMT'14 En->Fr 25.2 28.3 32.6 - - - -
BLEU WMT'14 Fr->En 21.2 33.7 38.2 - - - -
BLEU WMT'16 En->De 24.6 26.2 29.7 - - - -
BLEU WMT'16 De->En 27.2 30.4 40.6 - - - -
BLEU WMT'16 En->Ro 14.1 20.6 21.0 - - - -
BLEU WMT'16 Ro->En 19.9 38.6 39.5 - - - -
PIQA 80.5 80.5 82.8 - - - -
ARC (easy) 68.8 71.2 70.1 - - - -
ARC (challenge) 51.4 53.2 51.5 - - - -
OpenBookQA 57.6 58.8 65.4 - - - -
CoQA 81.5 84.0 85.9 - - - -
DROP 23.6 34.3 36.5 - - - -
QuAC 41.5 43.5 44.3 - - - -
SQuADv2 59.5 65.4 69.8 - - - -
RACE-h 45.5 45.9 46.8 - - - -
RACE-m 58.4 57.4 58.1 - - - -
BoolQ Acc - - 76.4 - - - -
CB Acc - - 75.6 - - - -
CB F1 - - 52.0 - - - -
COPA Acc - - 92.0 - - - -
RTE Acc - - 69.0 - - - -
WiC Acc - - 49.4 - - 63.4 69
WSC Acc - - 80.1 - - - -
MultiRC Acc - - 30.5 - - - -
ReCoRD Acc - - 90.2 - - - -
ReCoRD F1 - - 91.1 - - - -
ANLI R1 (Dev) - - 36.8 - - - -
ANLI R2 (Dev) - - 34.0 - - - -
ANLI R3 (Dev) - - 40.2 - - 42.5 -
2D+ 76.9 99.6 100.0 - - - -
2D- 58.0 86.4 98.9 - - - -
3D+ 34.2 65.5 80.4 - - - -
3D- 48.3 78.7 94.2 - - - -
4D+ 4 14 25.5 - - - -
4D- 7.5 14.0 25.5 - - - -
5D+ .7 3.5 9.3 - - - -
5D- .8 3.8 9.9 - - - -
2Dx 19.8 27.4 29.2 - - - -
1Dc 9.8 14.3 21.3 - - - -
Cycle Letter 3.66 21.7 37.9|- - - -
Anagram Cycle F/L 2.28 8.62 15.1 - - - -
Anagram Cycle Mid 8.91 25.9 39.7 - - - -
Random Insertion in Word 8.26 45.4 67.2 - - - -
Reversed Word .09 .48 .44 - - - -
- - - -
- - - -


Tasks not Attempted in Original Paper
Task SOTA 1-Shot Performance Few Shot Performance Multi-Prompt Performance
Break High Level (EM) .09 - .09 -
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License