GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Abstract

Scaling language models with more data, compute and parameters has drivensignificant progress in natural language processing. For example, thanks toscaling, GPT-3 was able to achieve strong results on in-context learning tasks.However, training these large dense models requires significant amounts ofcomputing resources. In this paper, we propose and develop a family of languagemodels named GLaM (Generalist Language Model), which uses a sparsely activatedmixture-of-experts architecture to scale the model capacity while alsoincurring substantially less training cost compared to dense variants. Thelargest GLaM has 1.2 trillion parameters, which is approximately 7x larger thanGPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires halfof the computation flops for inference, while still achieving better overallzero-shot and one-shot performance across 29 NLP tasks.

Quick Read (beta)

loading the full paper ...