MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators

Abstract

A critical approach for efficiently deploying Mixture-of-Experts (MoE) modelswith massive parameters is quantization. However, state-of-the-art MoE modelssuffer from non-negligible accuracy loss with extreme quantization, such asunder 4 bits. To address this, we introduce MiLo, a novel method that augmentshighly quantized MoEs with a mixture of low-rank compensators. Thesecompensators consume only a small amount of additional memory but significantlyrecover accuracy loss from extreme quantization. MiLo also identifies thatMoEmodels exhibit distinctive characteristics across weights due to theirhybrid dense-sparse architectures, and employs adaptive rank selection policiesalong with iterative optimizations to close the accuracy gap. MiLo does notrely on calibration data, allowing it to generalize to different MoE models anddatasets without overfitting to a calibration set. To avoid the hardwareinefficiencies of extreme quantization, such as 3-bit, MiLo develops TensorCore-friendly 3-bit kernels, enabling measured latency speedups on 3-bitquantized MoE models. Our evaluation shows that MiLo outperforms existingmethods on SoTA MoE models across various tasks.

Quick Read (beta)

loading the full paper ...