Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recentlydemonstrated notable success in enhancing the reasoning capabilities of LLMs,particularly in mathematics and programming tasks. It is widely believed thatRLVR enables LLMs to continuously self-improve, thus acquiring novel reasoningabilities that exceed corresponding base models' capacity. In this study,however, we critically re-examines this assumption by measuring thepass@\textit{k} metric with large values of \textit{k} to explore the reasoningcapability boundary of the models across a wide range of model families andbenchmarks. Surprisingly, the RL does \emph{not}, in fact, elicit fundamentallynew reasoning patterns. While RL-trained models outperform their base models atsmaller values of $k$ (\eg, $k$=1), base models can achieve a comparable oreven higher pass@$k$ score compared to their RL counterparts at large $k$values. The reasoning paths generated by RL-trained models are already includedin the base models' sampling distribution, suggesting that most reasoningabilities manifested in RL-trained models are already obtained by base models.Further analysis shows that RL training boosts the performance by biasing themodel's output distribution toward paths that are more likely to yield rewards,therefore sampling correct responses more efficiently. But this also results ina narrower reasoning capability boundary compared to base models. Similarresults are observed in visual reasoning tasks trained with RLVR. Moreover, wefind that distillation can genuinely introduce new knowledge into the model,different from RLVR. These findings underscore a critical limitation of RLVR inadvancing LLM reasoning abilities which requires us to fundamentally rethinkthe impact of RL training in reasoning LLMs and the need of a better paradigm.Project Page: https://limit-of-RLVR.github.io

Quick Read (beta)

loading the full paper ...