Abstract
Despite recent progress in large-scale reinforcement learning (RL) forreasoning, the training recipe for building high-performing reasoning modelsremains elusive. Key implementation details of frontier models, such asDeepSeek-R1, including data curation strategies and RL training recipe, areoften omitted. Moreover, recent research indicates distillation remains moreeffective than RL for smaller models. In this work, we demonstrate thatlarge-scale RL can significantly enhance the reasoning capabilities of strong,small- and mid-sized models, achieving results that surpass those ofstate-of-the-art distillation-based models. We systematically study the RLtraining process through extensive ablations and propose a simple yet effectiveapproach: first training on math-only prompts, then on code-only prompts.Notably, we find that math-only RL not only significantly enhances theperformance of strong distilled models on math benchmarks (e.g., +14.6% /+17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks(e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition,extended code-only RL iterations further improve performance on code benchmarkswith minimal or no degradation in math results. We develop a robust datacuration pipeline to collect challenging prompts with high-quality, verifiableanswers and test cases to enable verification-based RL across both domains.Finally, we identify key experimental insights, including curriculum learningwith progressively increasing response lengths and the stabilizing effect ofon-policy parameter updates. We find that RL not only elicits the foundationalreasoning capabilities acquired during pretraining and supervised fine-tuning(e.g., distillation), but also pushes the limits of the model's reasoningability, enabling it to solve problems that were previously unsolvable.