Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of tex...
Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly ...
Our basic pre-training approach, including model, data, and training, is similar to the process described in prior work,...
We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversibl...
Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset constituting nearly a trilli...
As found in prior work, larger models can typically use a larger batch size, but require a smaller learning rate. We mea...
For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task's tr...
In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6 addit...
In this section we test GPT-3's performance on the traditional task of language modeling, as well as related tasks that ...
In this section we measure GPT-3's ability to answer questions about broad factual knowledge. We evaluate GPT-3 on the 3...
For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity...
The Winograd Schemas Challenge is a classical task in NLP that involves determining which word a pronoun refers to, when...
Next we consider three datasets which attempt to capture physical or scientific reasoning. The first, PhysicalQA (PIQA),...
Next we evaluate GPT-3 on the task of reading comprehension using a suite of 5 datasets including abstractive, multiple ...
In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more syste...
Natural Language Inference (NLI) concerns the ability to understand the relationship between two sentences. In practice,...
One way to probe GPT-3's range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which re...
Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchm...
GPT-3 and our analysis of it have a number of limitations. First, despite the strong quantitative and qualitative improv...
Language models have a wide range of beneficial applications for society, including code and writing auto-completion, gr...
Several lines of work have focused on increasing parameter count and/or computation in language models as a means to imp...
We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in t...