Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Abstract

Recent work in unsupervised language modeling demonstrates that traininglarge neural language models advances the state of the art in Natural LanguageProcessing applications. However, for very large models, memory constraintslimit the size of models that can be practically trained. Model parallelismallows us to train larger models, because the parameters can be split acrossmultiple processors. In this work, we implement a simple, efficient intra-layermodel parallel approach that enables training state of the art transformerlanguage models with billions of parameters. Our approach does not require anew compiler or library changes, is orthogonal and complimentary to pipelinemodel parallelism, and can be fully implemented with the insertion of a fewcommunication operations in native PyTorch. We illustrate this approach byconverging an 8.3 billion parameter transformer language model using 512 GPUs,making it the largest transformer model ever trained at 24x times the size ofBERT and 5.6x times the size of GPT-2. We sustain up to 15.1 PetaFLOPs persecond across the entire application with 76% scaling efficiency, compared to astrong single processor baseline that sustains 39 TeraFLOPs per second, whichis 30% of peak FLOPs. The model is trained on 174GB of text, requiring 12ZettaFLOPs over 9.2 days to converge. Transferring this language model achievesstate of the art (SOTA) results on the WikiText103 (10.8 compared to SOTAperplexity of 16.4) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%)datasets. We release training and evaluation code, as well as the weights ofour smaller portable model, for reproducibility.

Quick Read (beta)

loading the full paper ...