Abstract
Reinforcement learning (RL) has demonstrated remarkable success in enhancingmodel capabilities, including instruction-following, preference learning, andreasoning. Yet despite its empirical successes, the mechanisms by which RLimproves reasoning abilities remain poorly understood. We present a systematicstudy of Reinforcement Learning with Verifiable Rewards (RLVR), showing thatits primary benefit comes from optimizing the selection of existing reasoningpatterns. Through extensive experiments, we demonstrate that RLVR-trainedmodels preferentially adopt high-success-rate reasoning patterns while mostlymaintaining stable performance on individual patterns. We further developtheoretical analyses on the convergence and training dynamics of RLVR based ona simplified question-reason-answer model. We study the gradient flow and showthat RLVR can indeed find the solution that selects the reason pattern with thehighest success rate. Besides, our theoretical results reveal two distinct regimes regarding the convergence of RLVR training: (1)rapid convergence for models with relatively strong initial reasoningcapabilities versus (2) slower optimization dynamics for weaker models.Furthermore, we show that the slower optimization for weaker models can bemitigated by applying the supervised fine-tuning (SFT) before RLVR, when usinga feasibly high-quality SFT dataset. We validate the theoretical findingsthrough extensive experiments. This work advances our theoretical understandingof RL's role in LLM fine-tuning and offers insights for further enhancingreasoning capabilities.