Recursive Introspection: Teaching Language Model Agents How to Self-Improve

  • 2024-07-25 18:35:59
  • Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar
  • 0

Abstract

A central piece in enabling intelligent agentic behavior in foundation modelsis to make them capable of introspecting upon their behavior, reasoning, andcorrecting their mistakes as more computation or interaction is available. Eventhe strongest proprietary large language models (LLMs) do not quite exhibit theability of continually improving their responses sequentially, even inscenarios where they are explicitly told that they are making a mistake. Inthis paper, we develop RISE: Recursive IntroSpEction, an approach forfine-tuning LLMs to introduce this capability, despite prior work hypothesizingthat this capability may not be possible to attain. Our approach prescribes aniterative fine-tuning procedure, which attempts to teach the model how to alterits response after having executed previously unsuccessful attempts to solve ahard test-time problem, with optionally additional environment feedback. RISEposes fine-tuning for a single-turn prompt as solving a multi-turn Markovdecision process (MDP), where the initial state is the prompt. Inspired byprinciples in online imitation learning and reinforcement learning, we proposestrategies for multi-turn data collection and training so as to imbue an LLMwith the capability to recursively detect and correct its previous mistakes insubsequent iterations. Our experiments show that RISE enables Llama2, Llama3,and Mistral models to improve themselves with more turns on math reasoningtasks, outperforming several single-turn strategies given an equal amount ofinference-time computation. We also find that RISE scales well, often attaininglarger benefits with more capable models. Our analysis shows that RISE makesmeaningful improvements to responses to arrive at the correct solution forchallenging prompts, without disrupting one-turn abilities as a result ofexpressing more complex distributions.

 

Quick Read (beta)

loading the full paper ...