DAPO: An Open-Source LLM Reinforcement Learning System at Scale

  • 2025-03-18 18:49:06
  • Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang
  • 0

Abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, withreinforcement learning as the core technique to elicit complex reasoning.However, key technical details of state-of-the-art reasoning LLMs are concealed(such as in OpenAI o1 blog and DeepSeek R1 technical report), thus thecommunity still struggles to reproduce their RL training results. We proposethe $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling$\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, andfully open-source a state-of-the-art large-scale RL system that achieves 50points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works thatwithhold training details, we introduce four key techniques of our algorithmthat make large-scale LLM RL a success. In addition, we open-source ourtraining code, which is built on the verl framework, along with a carefullycurated and processed dataset. These components of our open-source systemenhance reproducibility and support future research in large-scale LLM RL.

 

Quick Read (beta)

loading the full paper ...