Abstract
Mixture-of-Experts (MoE) models enable efficient scaling of large languagemodels (LLMs) by activating only a subset of experts per input. However, weobserve that the commonly used auxiliary load balancing loss often leads toexpert overlap and overly uniform routing, which hinders expert specializationand degrades overall performance during post-training. To address this, wepropose a simple yet effective solution that introduces two complementaryobjectives: (1) an orthogonality loss to encourage experts to process distincttypes of tokens, and (2) a variance loss to encourage more discriminativerouting decisions. Gradient-level analysis demonstrates that these objectivesare compatible with the existing auxiliary loss and contribute to optimizingthe training process. Experimental results over various model architectures andacross multiple benchmarks show that our method significantly enhances expertspecialization. Notably, our method improves classic MoE baselines withauxiliary loss by up to 23.79%, while also maintaining load balancing indownstream tasks, without any architectural modifications or additionalcomponents. We will release our code to contribute to the community.