Large Language Models are Zero-Shot Reasoners

Abstract

Pretrained large language models (LLMs) are widely used in many sub-fields ofnatural language processing (NLP) and generally known as excellent few-shotlearners with task-specific exemplars. Notably, chain of thought (CoT)prompting, a recent technique for eliciting complex multi-step reasoningthrough step-by-step answer examples, achieved the state-of-the-artperformances in arithmetics and symbolic reasoning, difficult system-2 tasksthat do not follow the standard scaling laws for LLMs. While these successesare often attributed to LLMs' ability for few-shot learning, we show that LLMsare decent zero-shot reasoners by simply adding ``Let's think step by step''before each answer. Experimental results demonstrate that our Zero-shot-CoT,using the same single prompt template, significantly outperforms zero-shot LLMperformances on diverse benchmark reasoning tasks including arithmetics(MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, CoinFlip), and other logical reasoning tasks (Date Understanding, Tracking ShuffledObjects), without any hand-crafted few-shot examples, e.g. increasing theaccuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% withan off-the-shelf 175B parameter model. The versatility of this single promptacross very diverse reasoning tasks hints at untapped and understudiedfundamental zero-shot capabilities of LLMs, suggesting high-level, multi-taskbroad cognitive capabilities may be extracted through simple prompting. We hopeour work not only serves as the minimal strongest zero-shot baseline for thechallenging reasoning benchmarks, but also highlights the importance ofcarefully exploring and analyzing the enormous zero-shot knowledge hiddeninside LLMs before crafting finetuning datasets or few-shot exemplars.

Quick Read (beta)

loading the full paper ...