Chain-of-Thought Reasoning is a Policy Improvement Operator

  • 2023-09-15 18:44:17
  • Hugh Zhang, David C. Parkes
  • 0


Large language models have astounded the world with fascinating newcapabilities. However, they currently lack the ability to teach themselves newskills, relying instead on being trained on large amounts of human-generateddata. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), aproof-of-concept demonstration that language models can successfully teachthemselves new skills using chain-of-thought reasoning. Inspired by previouswork in both reinforcement learning (Silver et al., 2017) and human cognition(Kahneman, 2011), SECToR first uses chain-of-thought reasoning to slowly thinkits way through problems. SECToR then fine-tunes the model to generate thosesame answers, this time without using chain-of-thought reasoning. Languagemodels trained via SECToR autonomously learn to add up to 29-digit numberswithout any access to any ground truth examples beyond an initial supervisedfine-tuning phase consisting only of numbers with 6 or fewer digits. Ourcentral hypothesis is that chain-of-thought reasoning can act as a policyimprovement operator, analogously to how Monte-Carlo Tree Search is used inAlphaZero. We hope that this research can lead to new directions in whichlanguage models can learn to teach themselves without the need for humandemonstrations.


