Abstract
Mixture of Experts (MoE) architectures have significantly increasedcomputational efficiency in both research and real-world applications oflarge-scale machine learning models. However, their scalability and efficiencyunder memory constraints remain relatively underexplored. In this work, wepresent joint scaling laws for dense and MoE models, incorporating key factorssuch as the number of active parameters, dataset size, and the number ofexperts. Our findings provide a principled framework for selecting the optimalMoE configuration under fixed memory and compute budgets. Surprisingly, we showthat MoE models can be more memory-efficient than dense models, contradictingconventional wisdom. To derive and validate the theoretical predictions of ourscaling laws, we conduct over 280 experiments with up to 2.7B active parametersand up to 5B total parameters. These results offer actionable insights fordesigning and deploying MoE models in practical large-scale training scenarios.