Abstract
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent largelanguage models since it can efficiently scale up the model capability withoutincreasing the inference cost. However, evaluations on broad downstream tasksreveal a consistent suboptimality of the routers in existing MoE LLMs, whichresults in a severe performance gap (e.g., 10-20% in accuracy) to the optimalrouting. In this paper, we show that aligning the manifold of routing weightswith that of task embedding can effectively reduce the gap and improve MoELLMs' generalization performance. Our method, "Routing Manifold Alignment(RoMA)", introduces an additional manifold regularization term in thepost-training objective and only requires lightweight finetuning of routers(with other parameters frozen). Specifically, the regularization encourages therouting weights of each sample to be close to those of its successful neighbors(whose routing weights lead to correct answers) in a task embedding space.Consequently, samples targeting similar tasks will share similar expert choicesacross layers. Building such bindings between tasks and experts over differentsamples is essential to achieve better generalization. Moreover, RoMAdemonstrates the advantage of unifying the task understanding (by embeddingmodels) with solution generation (by MoE LLMs). In experiments, we finetunerouters in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diversebenchmarks and extensive comparisons with baselines show the substantialimprovement brought by RoMA.