Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

Abstract

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language modeloptimized via reinforcement learning (RL) to achieve efficient and robustreasoning capabilities. Built upon the publicly available Ling-lite model, a16.8 billion parameter model with 2.75 billion activated parameters, ourapproach matches the performance of state-of-the-art (SOTA) small-scalereasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench,GPQA-Diamond) while activating only one-third of the parameters required bycomparable models. To accomplish this, we introduce a joint training pipelineintegrating distillation with RL, revealing undocumented challenges in MoE RLtraining. First, we identify optimization instability during RL training, andwe propose Constrained Contextual Computation Policy Optimization(C3PO), anovel approach that enhances training stability and improves computationalthroughput via algorithm-system co-design methodology. Second, we empiricallydemonstrate that selecting distillation checkpoints based on entropy loss forRL training, rather than validation metrics, yields superiorperformance-efficiency trade-offs in subsequent RL training. Finally, wedevelop a two-stage training paradigm to harmonize multi-domain dataintegration, addressing domain conflicts that arise in training with mixeddataset. We will release the model, dataset, and code.

Quick Read (beta)

loading the full paper ...