Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

  • 2025-06-17 08:06:56
  • Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang
  • 0

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as apromising paradigm for advancing the reasoning capabilities of Large LanguageModels (LLMs). However, a critical paradox clouds its efficacy: RLVR-tunedmodels often underperform their base models on the $Pass@K$ metric forsolution-finding, leading to the hypothesis that RLVR merely re-weightsexisting reasoning paths at the cost of reasoning diversity. In this work, weresolve this contradiction by identifying the source of the problem: the$Pass@K$ metric itself is a flawed measure of reasoning, as it credits correctfinal answers that probably arise from inaccurate or incomplete chains ofthought (CoTs). To address this, we introduce a more precise evaluation metric,$CoT$-$Pass@K$, which mandates that both the reasoning path and the finalanswer be correct. We provide a new theoretical foundation that formalizes howRLVR, unlike traditional RL, is uniquely structured to incentivize logicalintegrity. Our empirical results are supportive: using $CoT$-$Pass@K$, weobserve that RLVR can incentivize the generalization of correct reasoning forall values of $K$. Furthermore, by analyzing the training dynamics, we findthat this enhanced reasoning capability emerges early in the training processand smoothly generalizes. Our work provides a clear perspective on the role ofRLVR, offers a more reliable method for its evaluation, and confirms itspotential to genuinely advance machine reasoning.

 

Quick Read (beta)

loading the full paper ...