Benchmarking
Testing how well the API performs on different benchmark tasks
| Tasks Attempted in Original Paper | |||||||
| Task | Paper 0-Shot | Paper 1-Shot | Paper Few-Shot | 0-Shot Current Best GPT | 1-Shot Current Best GPT | Few-Shot Current Best GPT | Chaining Current Best GPT |
| LAMBADA (acc) | 76.2 | 72.5 | 86.4 | - | - | - | - |
| LAMBADA (ppl) | 3 | 3.35 | 1.92 | - | - | - | - |
| StoryCloze (acc) | 83.2 | 84.7 | 87.7 | - | - | - | - |
| HellaSwag (acc) | 78.9 | 78.1 | 79.3 | - | - | - | - |
| NaturalQS | 14.6 | 23.0 | 29.9 | - | - | - | - |
| WebQS | 64.3 | 68.0 | 71.2 | - | - | - | - |
| TriviaQS | 64.3 | 68.0 | 71.2 | - | - | - | - |
| BLEU WMT'14 En->Fr | 25.2 | 28.3 | 32.6 | - | - | - | - |
| BLEU WMT'14 Fr->En | 21.2 | 33.7 | 38.2 | - | - | - | - |
| BLEU WMT'16 En->De | 24.6 | 26.2 | 29.7 | - | - | - | - |
| BLEU WMT'16 De->En | 27.2 | 30.4 | 40.6 | - | - | - | - |
| BLEU WMT'16 En->Ro | 14.1 | 20.6 | 21.0 | - | - | - | - |
| BLEU WMT'16 Ro->En | 19.9 | 38.6 | 39.5 | - | - | - | - |
| PIQA | 80.5 | 80.5 | 82.8 | - | - | - | - |
| ARC (easy) | 68.8 | 71.2 | 70.1 | - | - | - | - |
| ARC (challenge) | 51.4 | 53.2 | 51.5 | - | - | - | - |
| OpenBookQA | 57.6 | 58.8 | 65.4 | - | - | - | - |
| CoQA | 81.5 | 84.0 | 85.9 | - | - | - | - |
| DROP | 23.6 | 34.3 | 36.5 | - | - | - | - |
| QuAC | 41.5 | 43.5 | 44.3 | - | - | - | - |
| SQuADv2 | 59.5 | 65.4 | 69.8 | - | - | - | - |
| RACE-h | 45.5 | 45.9 | 46.8 | - | - | - | - |
| RACE-m | 58.4 | 57.4 | 58.1 | - | - | - | - |
| BoolQ Acc | - | - | 76.4 | - | - | - | - |
| CB Acc | - | - | 75.6 | - | - | - | - |
| CB F1 | - | - | 52.0 | - | - | - | - |
| COPA Acc | - | - | 92.0 | - | - | - | - |
| RTE Acc | - | - | 69.0 | - | - | - | - |
| WiC Acc | - | - | 49.4 | - | - | 63.4 | 69 |
| WSC Acc | - | - | 80.1 | - | - | - | - |
| MultiRC Acc | - | - | 30.5 | - | - | - | - |
| ReCoRD Acc | - | - | 90.2 | - | - | - | - |
| ReCoRD F1 | - | - | 91.1 | - | - | - | - |
| ANLI R1 (Dev) | - | - | 36.8 | - | - | - | - |
| ANLI R2 (Dev) | - | - | 34.0 | - | - | - | - |
| ANLI R3 (Dev) | - | - | 40.2 | - | - | 42.5 | - |
| 2D+ | 76.9 | 99.6 | 100.0 | - | - | - | - |
| 2D- | 58.0 | 86.4 | 98.9 | - | - | - | - |
| 3D+ | 34.2 | 65.5 | 80.4 | - | - | - | - |
| 3D- | 48.3 | 78.7 | 94.2 | - | - | - | - |
| 4D+ | 4 | 14 | 25.5 | - | - | - | - |
| 4D- | 7.5 | 14.0 | 25.5 | - | - | - | - |
| 5D+ | .7 | 3.5 | 9.3 | - | - | - | - |
| 5D- | .8 | 3.8 | 9.9 | - | - | - | - |
| 2Dx | 19.8 | 27.4 | 29.2 | - | - | - | - |
| 1Dc | 9.8 | 14.3 | 21.3 | - | - | - | - |
| Cycle Letter | 3.66 | 21.7 | 37.9|- | - | - | - | |
| Anagram Cycle F/L | 2.28 | 8.62 | 15.1 | - | - | - | - |
| Anagram Cycle Mid | 8.91 | 25.9 | 39.7 | - | - | - | - |
| Random Insertion in Word | 8.26 | 45.4 | 67.2 | - | - | - | - |
| Reversed Word | .09 | .48 | .44 | - | - | - | - |
| - | - | - | - | ||||
| - | - | - | - | ||||
| Tasks not Attempted in Original Paper | ||||
| Task | SOTA | 1-Shot Performance | Few Shot Performance | Multi-Prompt Performance |
| Break High Level (EM) | .09 | - | .09 | - |
page revision: 34, last edited: 15 Aug 2020 00:17