Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

Abstract

The Mixture of Experts (MoE) is an effective architecture for scaling largelanguage models by leveraging sparse expert activation to balance performanceand efficiency. However, under expert parallelism, MoE suffers from inferenceinefficiencies due to imbalanced token-to-expert assignment, where underloadedexperts complete computations early but must wait for overloaded experts,leading to global delays. We define this phenomenon as the\textbf{\textit{Straggler Effect}}, as the most burdened experts dictate theoverall inference latency. To address this, we first propose\textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacitylimits by discarding excess tokens from overloaded experts, effectivelyreducing load imbalance with minimal performance impact (e.g., $30\%$ speedupwith only $0.9\%$ degradation on OLMoE). Next, given the presence of low-loadexperts remaining well below the capacity threshold, we introduce\textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to includeadditional local experts in their candidate set before enforcing strict localcapacity constraints, thereby improving load balance and enhancing theutilization of underused experts. Extensive experiments on both language andmultimodal MoE models demonstrate the effectiveness of our approach, yieldingsubstantial gains in expert utilization, model performance, and inferenceefficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yieldsa {0.2\%} average performance improvement and a {1.85$\times$} inferencespeedup.

Quick Read (beta)

loading the full paper ...