GPT-3
GPT-3 is an enormous model built on the transformer-decoder architecture published in 2020 by OpenAI in this paper: “Language Models are Few-Shot Learners” whose title is very indicative of what the paper wanted to show. The paper didn’t provide any new architecture, they used the same architecture as GPT-2. They just made it way bigger and trained over more data.
The whole purpose of this paper is to show that GPT-3 can be used with a variety of tasks using either zero-shot, or one-shot or a few-shots learning schemes and even reaching competitiveness with prior state-of-the-art fine-tuned models. Before getting into more details about the model, let’s first discuss what do I mean by these learning schemes and how they are different from fine-tuning:
- Few-shot (FS):
It’s the setting where the model is given K (usually from 10 to 100) examples of the task at inference time as conditioning, but no weight updates are allowed. As we can see in the following figure, GPT-3 was given three different examples along with the task description:
- One-shot (1S):
It’s the same as few-shot except that only one demonstration is allowed, in addition to the task description. The reason to distinguish one-shot from few-shot is that it most closely matches the way in which some tasks are communicated to humans:
- Zero-shot (0S):
It’s the same as one-shot except that no demonstrations are allowed, just the task description. This method provides maximum potential for robustness but is also the most challenging setting even for humans.
- Fine-Tuning (FT):
It has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. This setting lacks from poor generalization out-of-distribution:
Model
As said earlier, they used the same model and architecture as GPT-2. To study the dependence of performance on model size, they trained 8 different sizes of model as shown in the following table
Where:
-
$n_{\text{params}}$: is the total number of trainable parameters.
-
$n_{\text{layers}}$: is the total number of layers.
-
$d_{\text{model}}$: is the number of units in each bottleneck layer (we always have the feed-forward layer four times the size of the bottleneck layer, $d_{\text{feedforward}} = 4 \times d_{\text{model}}$).
-
$n_{\text{heads}}$: is the number of attention heads/layers, since each layer has just one attention head.
-
$d_{head}$: is the dimension of each attention head.
As you can see, GPT3 is massive as its context-widow $n_{\text{ctx}} = 2048$ tokens wide with about 175 billion learnable parameters spread over 96 transformer-decoder layers.
The data used for this models are according to the following table
And the following is a comparison between the training time taken to train BERT, RoBERTa, T5 and GPT-3. As we can see from the graph, it took almost 5000 days to train GPT-3.
Results
The following is a comparison among the different learning schemes used with GPT-3 and the state or the art (fine-tuned) model on various tasks:
-
Language Modeling:
-
Dataset: Penn Tree Bank
-
Evaluation Metric: perplexity
-
-
Long-Range Language Modeling:
-
Dataset: LAMBADA
-
Evaluation Metric: perplexity / Accuracy
-
-
Story Completion:
-
Dataset: StoryCloze & HellaSwag
-
Evaluation Metric: Accuracy
-
-
Question Answering:
-
Dataset: NaturalQS, WebQS & TriviaQA
-
Evaluation Metric: Accuracy
-
-
Machine Translation:
-
Dataset: WMT’14 (Fr↔En), WMT’16 (De↔En) & WMT’16 (Ro↔En).
-
Evaluation Metric: BLEU
-
-
Winograd-Style Tasks: determining to which word a pronoun refers
-
Dataset: Winograd & WinogradXL
-
Evaluation Metric: Accuracy
-
-
Common Sense Reasoning:
-
Dataset: PIQA, ARC, OpenBookQA
-
Evaluation Metric: Accuracy
-
-
Reading Comprehension:
-
Dataset: CoQA, DROP, QuAC, SQuADv2, RACE-h, RACE-m.
-
Evaluation Metric: Accuracy for RACE-h & RACE-m, and F1 for the rest.
-