Chain-of-Thought Reasoning is a Policy Improvement Operator

Abstract

Large language models have astounded the world with fascinating newcapabilities. However, they currently lack the ability to teach themselves newskills, relying instead on being trained on large amounts of human-generateddata. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), aproof-of-concept demonstration that language models can successfully teachthemselves new skills using chain-of-thought reasoning. Inspired by previouswork in both reinforcement learning (Silver et al., 2017) and human cognition(Kahneman, 2011), SECToR first uses chain-of-thought reasoning to slowly thinkits way through problems. SECToR then fine-tunes the model to generate thosesame answers, this time without using chain-of-thought reasoning. Languagemodels trained via SECToR autonomously learn to add up to 29-digit numberswithout any access to any ground truth examples beyond an initial supervisedfine-tuning phase consisting only of numbers with 6 or fewer digits. Ourcentral hypothesis is that chain-of-thought reasoning can act as a policyimprovement operator, analogously to how Monte-Carlo Tree Search is used inAlphaZero. We hope that this research can lead to new directions in whichlanguage models can learn to teach themselves without the need for humandemonstrations.

Quick Read (beta)

loading the full paper ...