Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Abstract

Reinforcement learning (RL) has become the dominant paradigm for endowinglanguage models with advanced reasoning capabilities. Despite the substantialempirical gains demonstrated by RL-based training methods like GRPO, a granularunderstanding of their advantages is still lacking. To address this gap, weintroduce a fine-grained analytic framework to dissect the impact of RL onreasoning. Our framework specifically investigates key elements that have beenhypothesized to benefit from RL training: (1) plan-following and execution, (2)problem decomposition, and (3) improved reasoning and knowledge utilization.Using this framework, we gain insights beyond mere accuracy. For instance,providing models with explicit step-by-step plans surprisingly degradesperformance on the most challenging benchmarks, yet RL-tuned models exhibitgreater robustness, experiencing markedly smaller performance drops than theirbase counterparts. This suggests that RL may not primarily enhance theexecution of external plans but rather empower models to formulate and followinternal strategies better suited to their reasoning processes. Conversely, weobserve that RL enhances the model's capacity to integrate provided knowledgeinto its reasoning process, leading to performance improvements across diversetasks. We also study difficulty, showing improved training by developing newways to exploit hard problems. Our findings lay a foundation for moreprincipled training and evaluation of reasoning models.

Quick Read (beta)

loading the full paper ...