Learning from Peers in Reasoning Models

Abstract

Large Reasoning Models (LRMs) have the ability to self-correct even when theymake mistakes in their reasoning paths. However, our study reveals that whenthe reasoning process starts with a short but poor beginning, it becomesdifficult for the model to recover. We refer to this phenomenon as the "PrefixDominance Trap". Inspired by psychological findings that peer interaction canpromote self-correction without negatively impacting already accurateindividuals, we propose **Learning from Peers** (LeaP) to address thisphenomenon. Specifically, every tokens, each reasoning path summarizes itsintermediate reasoning and shares it with others through a routing mechanism,enabling paths to incorporate peer insights during inference. However, weobserve that smaller models sometimes fail to follow summarization andreflection instructions effectively. To address this, we fine-tune them intoour **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025,and GPQA Diamond show that LeaP provides substantial improvements. Forinstance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than thebaseline on average, and surpasses DeepSeek-R1-671B on three math benchmarkswith an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matchesthe performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysisreveals LeaP's robust error correction by timely peer insights, showing strongerror tolerance and handling varied task difficulty. LeaP marks a milestone byenabling LRMs to collaborate during reasoning. Our code, datasets, and modelsare available at https://learning-from-peers.github.io/ .

Quick Read (beta)

loading the full paper ...