GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  • 2022-08-01 22:07:58
  • Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui
  • 0

Abstract

Scaling language models with more data, compute and parameters has drivensignificant progress in natural language processing. For example, thanks toscaling, GPT-3 was able to achieve strong results on in-context learning tasks.However, training these large dense models requires significant amounts ofcomputing resources. In this paper, we propose and develop a family of languagemodels named GLaM (Generalist Language Model), which uses a sparsely activatedmixture-of-experts architecture to scale the model capacity while alsoincurring substantially less training cost compared to dense variants. Thelargest GLaM has 1.2 trillion parameters, which is approximately 7x larger thanGPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires halfof the computation flops for inference, while still achieving better overallzero-shot and one-shot performance across 29 NLP tasks.

 

Quick Read (beta)

loading the full paper ...