Exploring Sparse Expert Models and Beyond

Abstract

Mixture-of-Experts (MoE) models can achieve promising results with outrageouslarge amount of parameters but constant computation cost, and thus it hasbecome a trend in model scaling. Still it is a mystery how MoE layers bringquality gains by leveraging the parameters with sparse activation. In thiswork, we investigate several key factors in sparse expert models. We observethat load imbalance may not be a significant problem affecting model quality,contrary to the perspectives of recent studies, while the number of sparselyactivated experts $k$ and expert capacity $C$ in top-$k$ routing cansignificantly make a difference in this context. Furthermore, we take a stepforward to propose a simple method called expert prototyping that splitsexperts into different prototypes and applies $k$ top-$1$ routing. Thisstrategy improves the model quality but maintains constant computational costs,and our further exploration on extremely large-scale models reflects that it ismore effective in training larger models. We push the model scale to over $1$trillion parameters and implement it on solely $480$ NVIDIA V100-32GB GPUs, incomparison with the recent SOTA Switch Transformer on $2048$ TPUs. The proposedgiant model achieves substantial speedup in convergence over the same-sizebaseline.

Quick Read (beta)

loading the full paper ...