ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Abstract

Recent research on Reasoning of Large Language Models (LLMs) has sought tofurther enhance their performance by integrating meta-thinking -- enablingmodels to monitor, evaluate, and control their reasoning processes for moreadaptive and effective problem-solving. However, current single-agent worklacks a specialized design for acquiring meta-thinking, resulting in lowefficacy. To address this challenge, we introduce Reinforced Meta-thinkingAgents (ReMA), a novel framework that leverages Multi-Agent ReinforcementLearning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to thinkabout thinking. ReMA decouples the reasoning process into two hierarchicalagents: a high-level meta-thinking agent responsible for generating strategicoversight and plans, and a low-level reasoning agent for detailed executions.Through iterative reinforcement learning with aligned objectives, these agentsexplore and learn collaboration, leading to improved generalization androbustness. Experimental results demonstrate that ReMA outperforms single-agentRL baselines on complex reasoning tasks, including competitive-levelmathematical benchmarks and LLM-as-a-Judge benchmarks. Comprehensive ablationstudies further illustrate the evolving dynamics of each distinct agent,providing valuable insights into how the meta-thinking reasoning processenhances the reasoning capabilities of LLMs.

Quick Read (beta)

loading the full paper ...