Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Abstract

This paper introduces Light-R1, an open-source suite for training longreasoning models using reproducible and cost-effective methodology. Given theproprietary nature of data used in the DeepSeek-R1 series, we develop analternative approach leveraging exclusively public data and models. Ourcurriculum training progressively increases data difficulty, combined withmulti-staged post-training. Our Light-R1-32B model, trained fromQwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in mathreasoning. Experimental results show that this curriculum approach becomes moreeffective when distinct, diverse datasets are available for different trainingstages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team onproprietary data) with 3,000 challenging examples from our curriculum datasetyielded state-of-the-art 7B and 14B models, while the 32B model,Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models.Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math,with AIME24 \& 25 scores of 74.0 and 60.2 respectively, surpassing many 32Bmodels and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training,Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticatedreasoning models more accessible and implementable in real-world applications.Our models, training data and code have been made available athttps://github.com/Qihoo360/Light-R1.

Quick Read (beta)

loading the full paper ...