Abstract
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale candirectly enhance the reasoning capabilities of LLMs without supervisedfine-tuning. In this work, we critically examine R1-Zero-like training byanalyzing its two core components: base models and RL. We investigate a widerange of base models, including DeepSeek-V3-Base, to understand how pretrainingcharacteristics influence RL performance. Our analysis reveals thatDeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base modelsdemonstrate strong reasoning capabilities even without prompt templates,suggesting potential pretraining biases. Additionally, we identify anoptimization bias in Group Relative Policy Optimization (GRPO), whichartificially increases response length (especially for incorrect outputs)during training. To address this, we introduce Dr. GRPO, an unbiasedoptimization method that improves token efficiency while maintaining reasoningperformance. Leveraging these insights, we present a minimalist R1-Zero recipethat achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing anew state-of-the-art. Our code is available athttps://github.com/sail-sg/understand-r1-zero.