Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers

  • 2025-10-21 17:19:46
  • Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, Fuli Luo
  • 0

Abstract

Reinforcement learning (RL) has emerged as a crucial approach for enhancingthe capabilities of large language models. However, in Mixture-of-Experts (MoE)models, the routing mechanism often introduces instability, even leading tocatastrophic RL training collapse. We analyze the training-inferenceconsistency of MoE models and identify a notable discrepancy in routingbehaviors between the two phases. Moreover, even under identical conditions,the routing framework can yield divergent expert selections across repeatedforward passes. To address this foundational inconsistency, we propose RolloutRouting Replay (R3), a method that records routing distributions from theinference engine and replays them during training. R3 significantly reducestraining-inference policy KL divergence and mitigates extreme discrepancieswithout compromising training speed. Extensive experiments on various settingsconfirm that R3 succeeds in stabilizing RL training, preventing collapse andoutperforming methods such as GSPO and TIS. We believe this work can offer anew solution for stabilizing RL in MoE models.

 

Quick Read (beta)

loading the full paper ...