Language Models Are Few Shot Learners (2020)
April 12, 2025In Language Models are Few-Shot Learners, it is shown that increasing the number of parameters of GPT-2 improves few-shot performance across various tasks. The proposed model, GPT-3, is an autoregressive language model with 175 billion parameters. Its architecture mirrors that of GPT-2, except that GPT-3 uses alternating dense and locally banded sparse attention patterns, similar to the Sparse Transformer.
The evaluation tasks include predicting tokens of a document, question answering, translation, determining the antecedent of a pronoun, common sense reasoning, reading comprehension, natural language inference, and computing arithmetic operations. GPT-3 improves the state of the art on LAMBADA, a task that asks model to predict the last word of sentences which require reading a paragraph of context.