Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

  • 2024-03-14 18:58:16
  • Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman
  • 0

Abstract

When writing and talking, people sometimes pause to think. Althoughreasoning-focused works have often framed reasoning as a method of answeringquestions or completing agentic tasks, reasoning is implicit in almost allwritten text. For example, this applies to the steps not stated between thelines of a proof or to the theory of mind underlying a conversation. In theSelf-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learnedby inferring rationales from few-shot examples in question-answering andlearning from those that lead to a correct answer. This is a highly constrainedsetting -- ideally, a language model could instead learn to infer unstatedrationales in arbitrary text. We present Quiet-STaR, a generalization of STaRin which LMs learn to generate rationales at each token to explain future text,improving their predictions. We address key challenges, including 1) thecomputational cost of generating continuations, 2) the fact that the LM doesnot initially know how to generate or use internal thoughts, and 3) the need topredict beyond individual next tokens. To resolve these, we propose a tokenwiseparallel sampling algorithm, using learnable tokens indicating a thought'sstart and end, and an extended teacher-forcing technique. Encouragingly,generated rationales disproportionately help model difficult-to-predict tokensand improve the LM's ability to directly answer difficult questions. Inparticular, after continued pretraining of an LM on a corpus of internet textwith Quiet-STaR, we find zero-shot improvements on GSM8K(5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) andobserve a perplexity improvement of difficult tokens in natural text.Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaRmarks a step towards LMs that can learn to reason in a more general andscalable way.

 

Quick Read (beta)

loading the full paper ...