Abstract
Recent work has demonstrated substantial gains on many NLP tasks andbenchmarks by pre-training on a large corpus of text followed by fine-tuning ona specific task. While typically task-agnostic in architecture, this methodstill requires task-specific fine-tuning datasets of thousands or tens ofthousands of examples. By contrast, humans can generally perform a new languagetask from only a few examples or from simple instructions - something whichcurrent NLP systems still largely struggle to do. Here we show that scaling uplanguage models greatly improves task-agnostic, few-shot performance, sometimeseven reaching competitiveness with prior state-of-the-art fine-tuningapproaches. Specifically, we train GPT-3, an autoregressive language model with175 billion parameters, 10x more than any previous non-sparse language model,and test its performance in the few-shot setting. For all tasks, GPT-3 isapplied without any gradient updates or fine-tuning, with tasks and few-shotdemonstrations specified purely via text interaction with the model. GPT-3achieves strong performance on many NLP datasets, including translation,question-answering, and cloze tasks, as well as several tasks that requireon-the-fly reasoning or domain adaptation, such as unscrambling words, using anovel word in a sentence, or performing 3-digit arithmetic. At the same time,we also identify some datasets where GPT-3's few-shot learning still struggles,as well as some datasets where GPT-3 faces methodological issues related totraining on large web corpora. Finally, we find that GPT-3 can generate samplesof news articles which human evaluators have difficulty distinguishing fromarticles written by humans. We discuss broader societal impacts of this findingand of GPT-3 in general.