ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Abstract

Recent advances in reasoning-centric language models have highlightedreinforcement learning (RL) as a promising method for aligning models withverifiable rewards. However, it remains contentious whether RL truly expands amodel's reasoning capabilities or merely amplifies high-reward outputs alreadylatent in the base model's distribution, and whether continually scaling up RLcompute reliably leads to improved reasoning performance. In this work, wechallenge prevailing assumptions by demonstrating that prolonged RL (ProRL)training can uncover novel reasoning strategies that are inaccessible to basemodels, even under extensive sampling. We introduce ProRL, a novel trainingmethodology that incorporates KL divergence control, reference policyresetting, and a diverse suite of tasks. Our empirical analysis reveals thatRL-trained models consistently outperform base models across a wide range ofpass@k evaluations, including scenarios where base models fail entirelyregardless of the number of attempts. We further show that reasoning boundaryimprovements correlates strongly with task competence of base model andtraining duration, suggesting that RL can explore and populate new regions ofsolution space over time. These findings offer new insights into the conditionsunder which RL meaningfully expands reasoning boundaries in language models andestablish a foundation for future work on long-horizon RL for reasoning. Werelease model weights to support further research:https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

Quick Read (beta)

loading the full paper ...