Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

  • 2025-06-17 18:12:34
  • Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen
  • 0

Abstract

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language modeloptimized via reinforcement learning (RL) to achieve efficient and robustreasoning capabilities. Built upon the publicly available Ling-lite model, a16.8 billion parameter model with 2.75 billion activated parameters, ourapproach matches the performance of state-of-the-art (SOTA) small-scalereasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench,GPQA-Diamond) while activating only one-third of the parameters required bycomparable models. To accomplish this, we introduce a joint training pipelineintegrating distillation with RL, revealing undocumented challenges in MoE RLtraining. First, we identify optimization instability during RL training, andwe propose Constrained Contextual Computation Policy Optimization(C3PO), anovel approach that enhances training stability and improves computationalthroughput via algorithm-system co-design methodology. Second, we empiricallydemonstrate that selecting distillation checkpoints based on entropy loss forRL training, rather than validation metrics, yields superiorperformance-efficiency trade-offs in subsequent RL training. Finally, wedevelop a two-stage training paradigm to harmonize multi-domain dataintegration, addressing domain conflicts that arise in training with mixeddataset. We will release the model, dataset, and code.

 

Quick Read (beta)

loading the full paper ...