Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as apromising paradigm for advancing the reasoning capabilities of Large LanguageModels (LLMs). However, a critical paradox clouds its efficacy: RLVR-tunedmodels often underperform their base models on the $Pass@K$ metric forsolution-finding, leading to the hypothesis that RLVR merely re-weightsexisting reasoning paths at the cost of reasoning diversity. In this work, weresolve this contradiction by identifying the source of the problem: the$Pass@K$ metric itself is a flawed measure of reasoning, as it credits correctfinal answers that probably arise from inaccurate or incomplete chains ofthought (CoTs). To address this, we introduce a more precise evaluation metric,$CoT$-$Pass@K$, which mandates that both the reasoning path and the finalanswer be correct. We provide a new theoretical foundation that formalizes howRLVR, unlike traditional RL, is uniquely structured to incentivize logicalintegrity. Our empirical results are supportive: using $CoT$-$Pass@K$, weobserve that RLVR can incentivize the generalization of correct reasoning forall values of $K$. Furthermore, by analyzing the training dynamics, we findthat this enhanced reasoning capability emerges early in the training processand smoothly generalizes. Our work provides a clear perspective on the role ofRLVR, offers a more reliable method for its evaluation, and confirms itspotential to genuinely advance machine reasoning.