Mixture of A Million Experts

Abstract

The feedforward (FFW) layers in standard transformer architectures incur alinear increase in computational costs and activation memory as the hiddenlayer width grows. Sparse mixture-of-experts (MoE) architectures have emergedas a viable approach to address this issue by decoupling model size fromcomputational cost. The recent discovery of the fine-grained MoE scaling lawshows that higher granularity leads to better performance. However, existingMoE models are limited to a small number of experts due to computational andoptimization challenges. This paper introduces PEER (parameter efficient expertretrieval), a novel layer design that utilizes the product key technique forsparse retrieval from a vast pool of tiny experts (over a million). Experimentson language modeling tasks demonstrate that PEER layers outperform dense FFWsand coarse-grained MoEs in terms of performance-compute trade-off. By enablingefficient utilization of a massive number of experts, PEER unlocks thepotential for further scaling of transformer models while maintainingcomputational efficiency.

Quick Read (beta)

loading the full paper ...