Benchmarking
Testing how well the API performs on different benchmark tasks
Tasks Attempted in Original Paper | |||||||
Task | Paper 0-Shot | Paper 1-Shot | Paper Few-Shot | 0-Shot Current Best GPT | 1-Shot Current Best GPT | Few-Shot Current Best GPT | Chaining Current Best GPT |
LAMBADA (acc) | 76.2 | 72.5 | 86.4 | - | - | - | - |
LAMBADA (ppl) | 3 | 3.35 | 1.92 | - | - | - | - |
StoryCloze (acc) | 83.2 | 84.7 | 87.7 | - | - | - | - |
HellaSwag (acc) | 78.9 | 78.1 | 79.3 | - | - | - | - |
NaturalQS | 14.6 | 23.0 | 29.9 | - | - | - | - |
WebQS | 64.3 | 68.0 | 71.2 | - | - | - | - |
TriviaQS | 64.3 | 68.0 | 71.2 | - | - | - | - |
BLEU WMT'14 En->Fr | 25.2 | 28.3 | 32.6 | - | - | - | - |
BLEU WMT'14 Fr->En | 21.2 | 33.7 | 38.2 | - | - | - | - |
BLEU WMT'16 En->De | 24.6 | 26.2 | 29.7 | - | - | - | - |
BLEU WMT'16 De->En | 27.2 | 30.4 | 40.6 | - | - | - | - |
BLEU WMT'16 En->Ro | 14.1 | 20.6 | 21.0 | - | - | - | - |
BLEU WMT'16 Ro->En | 19.9 | 38.6 | 39.5 | - | - | - | - |
PIQA | 80.5 | 80.5 | 82.8 | - | - | - | - |
ARC (easy) | 68.8 | 71.2 | 70.1 | - | - | - | - |
ARC (challenge) | 51.4 | 53.2 | 51.5 | - | - | - | - |
OpenBookQA | 57.6 | 58.8 | 65.4 | - | - | - | - |
CoQA | 81.5 | 84.0 | 85.9 | - | - | - | - |
DROP | 23.6 | 34.3 | 36.5 | - | - | - | - |
QuAC | 41.5 | 43.5 | 44.3 | - | - | - | - |
SQuADv2 | 59.5 | 65.4 | 69.8 | - | - | - | - |
RACE-h | 45.5 | 45.9 | 46.8 | - | - | - | - |
RACE-m | 58.4 | 57.4 | 58.1 | - | - | - | - |
BoolQ Acc | - | - | 76.4 | - | - | - | - |
CB Acc | - | - | 75.6 | - | - | - | - |
CB F1 | - | - | 52.0 | - | - | - | - |
COPA Acc | - | - | 92.0 | - | - | - | - |
RTE Acc | - | - | 69.0 | - | - | - | - |
WiC Acc | - | - | 49.4 | - | - | 63.4 | 69 |
WSC Acc | - | - | 80.1 | - | - | - | - |
MultiRC Acc | - | - | 30.5 | - | - | - | - |
ReCoRD Acc | - | - | 90.2 | - | - | - | - |
ReCoRD F1 | - | - | 91.1 | - | - | - | - |
ANLI R1 (Dev) | - | - | 36.8 | - | - | - | - |
ANLI R2 (Dev) | - | - | 34.0 | - | - | - | - |
ANLI R3 (Dev) | - | - | 40.2 | - | - | 42.5 | - |
2D+ | 76.9 | 99.6 | 100.0 | - | - | - | - |
2D- | 58.0 | 86.4 | 98.9 | - | - | - | - |
3D+ | 34.2 | 65.5 | 80.4 | - | - | - | - |
3D- | 48.3 | 78.7 | 94.2 | - | - | - | - |
4D+ | 4 | 14 | 25.5 | - | - | - | - |
4D- | 7.5 | 14.0 | 25.5 | - | - | - | - |
5D+ | .7 | 3.5 | 9.3 | - | - | - | - |
5D- | .8 | 3.8 | 9.9 | - | - | - | - |
2Dx | 19.8 | 27.4 | 29.2 | - | - | - | - |
1Dc | 9.8 | 14.3 | 21.3 | - | - | - | - |
Cycle Letter | 3.66 | 21.7 | 37.9|- | - | - | - | |
Anagram Cycle F/L | 2.28 | 8.62 | 15.1 | - | - | - | - |
Anagram Cycle Mid | 8.91 | 25.9 | 39.7 | - | - | - | - |
Random Insertion in Word | 8.26 | 45.4 | 67.2 | - | - | - | - |
Reversed Word | .09 | .48 | .44 | - | - | - | - |
- | - | - | - | ||||
- | - | - | - |
Tasks not Attempted in Original Paper | ||||
Task | SOTA | 1-Shot Performance | Few Shot Performance | Multi-Prompt Performance |
Break High Level (EM) | .09 | - | .09 | - |
page revision: 34, last edited: 15 Aug 2020 00:17