Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as apromising paradigm for advancing the reasoning capabilities of Large LanguageModels (LLMs). However, a critical paradox clouds its efficacy: RLVR-tunedmodels often underperform their base models on the $Pass@K$ metric forsolution-finding, leading to the hypothesis that RLVR merely re-weightsexisting reasoning paths at the cost of reasoning diversity. In this work, weresolve this contradiction by identifying the source of the problem: the$Pass@K$ metric itself is a flawed measure of reasoning, as it credits correctfinal answers that probably arise from inaccurate or incomplete chains ofthought (CoTs). To address this, we introduce a more precise evaluation metric,$CoT$-$Pass@K$, which mandates that both the reasoning path and the finalanswer be correct. We provide a new theoretical foundation that formalizes howRLVR, unlike traditional RL, is uniquely structured to incentivize logicalintegrity. Our empirical results are supportive: using $CoT$-$Pass@K$, weobserve that RLVR can incentivize the generalization of correct reasoning forall values of $K$. Furthermore, by analyzing the training dynamics, we findthat this enhanced reasoning capability emerges early in the training processand smoothly generalizes. Our work provides a clear perspective on the role ofRLVR, offers a more reliable method for its evaluation, and confirms itspotential to genuinely advance machine reasoning.

Quick Read (beta)

loading the full paper ...