Abstract
Recent advancements in reasoning-focused language models such as OpenAI's O1and DeepSeek-R1 have shown that scaling test-time computation-throughchain-of-thought reasoning and iterative exploration-can yield substantialimprovements on complex tasks like mathematics and code generation. Thesebreakthroughs have been driven by large-scale reinforcement learning (RL),particularly when combined with verifiable reward signals that provideobjective and grounded supervision. In this report, we investigate the effectsof prolonged reinforcement learning on a small language model across a diverseset of reasoning domains. Our work identifies several key ingredients foreffective training, including the use of verifiable reward tasks, enhancementsto Group Relative Policy Optimization (GRPO), and practical techniques toimprove training stability and generalization. We introduce controlled KLregularization, clipping ratio, and periodic reference policy resets ascritical components for unlocking long-term performance gains. Our modelachieves significant improvements over strong baselines, including +14.7% onmath, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitatecontinued research, we release our model publicly.