Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

  • 2025-10-21 17:46:14
  • Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wen
  • 0

Abstract

We present Ring-1T, the first open-source, state-of-the-art thinking modelwith a trillion-scale parameter. It features 1 trillion total parameters andactivates approximately 50 billion per token. Training such models at atrillion-parameter scale introduces unprecedented challenges, includingtrain-inference misalignment, inefficiencies in rollout processing, andbottlenecks in the RL system. To address these, we pioneer three interconnectedinnovations: (1) IcePop stabilizes RL training via token-level discrepancymasking and clipping, resolving instability from training-inference mismatches;(2) C3PO++ improves resource utilization for long rollouts under a token budgetby dynamically partitioning them, thereby obtaining high time efficiency; and(3) ASystem, a high-performance RL framework designed to overcome the systemicbottlenecks that impede trillion-parameter model training. Ring-1T deliversbreakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 onHMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains asilver medal-level result on the IMO-2025, underscoring its exceptionalreasoning capabilities. By releasing the complete 1T parameter MoE model to thecommunity, we provide the research community with direct access to cutting-edgereasoning capabilities. This contribution marks a significant milestone indemocratizing large-scale reasoning intelligence and establishes a new baselinefor open-source model performance.

 

Quick Read (beta)

loading the full paper ...