Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Abstract

Recent work in language modeling demonstrates that training large transformermodels advances the state of the art in Natural Language Processingapplications. However, very large models can be quite difficult to train due tomemory constraints. In this work, we present our techniques for training verylarge transformer models and implement a simple, efficient intra-layer modelparallel approach that enables training transformer models with billions ofparameters. Our approach does not require a new compiler or library changes, isorthogonal and complimentary to pipeline model parallelism, and can be fullyimplemented with the insertion of a few communication operations in nativePyTorch. We illustrate this approach by converging transformer based models upto 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across theentire application with 76% scaling efficiency when compared to a strong singleGPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. Todemonstrate that large language models can further advance the state of the art(SOTA), we train an 8.3 billion parameter transformer language model similar toGPT-2 and a 3.9 billion parameter model similar to BERT. We show that carefulattention to the placement of layer normalization in BERT-like models iscritical to achieving increased performance as the model size grows. Using theGPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTAperplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%)datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9%compared to SOTA accuracy of 89.4%).

Quick Read (beta)

loading the full paper ...