Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

  • 2025-11-10 18:25:26
  • Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki
  • 0

Abstract

Foundation models exhibit broad knowledge but limited task-specificreasoning, motivating post-training strategies such as RLVR and inferencescaling with outcome or process reward models (ORM/PRM). While recent workhighlights the role of exploration and entropy stability in improving pass@K,empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforceexisting tree-like reasoning paths rather than expanding the reasoning scope,raising the question of why exploration helps at all if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025),viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering asymmetry) reasoning steps as low- versus high-probability Markov transitions,and formalize post-training dynamics through Multi-task Tree-structured MarkovChains (TMC). In this tractable model, pretraining corresponds to treeexpansion, while post-training corresponds to chain-of-thought reweighting. Weshow that several phenomena recently observed in empirical studies arisenaturally in this setting: (1) RLVR induces a squeezing effect, reducingreasoning entropy and forgetting some correct paths; (2) population rewards ofORM/PRM encourage consistency rather than accuracy, thereby favoring commonpatterns; and (3) certain rare, high-uncertainty reasoning paths by the basemodel are responsible for solving hard problem instances. Together, these explain why exploration -- even when confined to the basemodel's reasoning scope -- remains essential: it preserves access to rare butcrucial reasoning traces needed for difficult cases, which are squeezed out byRLVR or unfavored by inference scaling. Building on this, we further show thatexploration strategies such as rejecting easy instances and KL regularizationhelp preserve rare reasoning traces. Empirical simulations corroborate ourtheoretical results.

 

Quick Read (beta)

loading the full paper ...