Recent work in unsupervised language modeling demonstrates that traininglarge neural language models advances the state of the art in Natural LanguageProcessing applications. However, for very large models, memory constraintslimit the size of models that can be practically trained. Model parallelismallows us to train larger models, because the parameters can be split acrossmultiple processors. In this work, we implement a simple, efficient intra-layermodel parallel approach that enables training state of the art transformerlanguage models with billions of parameters. Our approach does not require anew compiler or library changes, is orthogonal and complimentary to pipelinemodel parallelism, and can be fully implemented with the insertion of a fewcommunication operations in native PyTorch. We illustrate this approach byconverging an 8.3 billion parameter transformer language model using 512 GPUs,making it the largest transformer model ever trained at 24x times the size ofBERT and 5.6x times the size of GPT-2. We sustain up to 15.1 PetaFLOPs persecond across the entire application with 76% scaling efficiency, compared to astrong single processor baseline that sustains 39 TeraFLOPs per second, whichis 30% of peak FLOPs. The model is trained on 174GB of text, requiring 12ZettaFLOPs over 9.2 days to converge. Transferring this language model achievesstate of the art (SOTA) results on the WikiText103 (10.8 compared to SOTAperplexity of 16.4) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%)datasets. We release training and evaluation code, as well as the weights ofour smaller portable model, for reproducibility.
Quick Read (beta)
Recent work in unsupervised language modeling demonstrates that training large neural language models advances the state of the art in Natural Language Processing applications. However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors. In this work, we implement a simple, efficient intra-layer model parallel approach that enables training state of the art transformer language models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging an 8.3 billion parameter transformer language model using 512 GPUs, making it the largest transformer model ever trained at 24x times the size of BERT and 5.6x times the size of GPT-2. We sustain up to 15.1 PetaFLOPs per second across the entire application with 76% scaling efficiency, compared to a strong single processor baseline that sustains 39 TeraFLOPs per second, which is 30% of peak FLOPs. The model is trained on 174GB of text, requiring 12 ZettaFLOPs over 9.2 days to converge. Transferring this language model achieves state of the art (SOTA) results on the WikiText103 (10.8 compared to SOTA perplexity of 16.4) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. We release training and evaluation code, as well as the weights of our smaller portable model, for reproducibility.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi 1 2 Mostofa Patwary 1 2 Raul Puri 1 2 Patrick LeGresley 2 Jared Casper 2 Bryan Catanzaro 2
Natural Language Processing (NLP) is advancing quickly, in part due to an increase in available compute and dataset size. The abundance of compute and data enables training increasingly larger language models via unsupervised language model pretraining (Devlin et al., 2018; Radford et al., 2019b). Empirical evidence indicates that larger language models are dramatically more useful for NLP tasks such as article completion, question answering, and natural language inference. By transferring or finetuning these pretrained language models on downstream natural language tasks, one can achieve state of the art results as shown in recent work (Devlin et al., 2018; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; 2019b; 2017; Ramachandran et al., 2016; Liu et al., 2019b; Dai et al., 2019; Yang et al., 2019; Liu et al., 2019a).
As these models become larger, they exceed the memory limit of modern processors, and require additional memory management techniques such as activation checkpointing (Chen et al., 2016). Widely used optimization algorithms such as ADAM require additional memory per parameter to store momentum and other optimizer state, which reduces the size of models that can be effectively trained. Several approaches to model parallelism overcome this limit by partitioning the model such that the weights and their associated optimizer state do not need to reside concurrently on the processor. For example, GPipe (Huang et al., 2018) and Mesh-Tensorflow (Shazeer et al., 2018) provide frameworks for model parallelism of different kinds. However, they require rewriting the model, and rely on custom compilers and frameworks that are still under development and not applicable to all problems.
In this work, we train a transformer-based language model with 8.3 billion parameters efficiently using intra-layer model-parallelism. We exploit the inherent structure in transformer based language models to make a simple model-parallel implementation that trains efficiently in PyTorch, with no custom C++ code or compiler required. This approach is orthogonal to pipeline-based model parallelism as advocated by approaches such as (Huang et al., 2018).
To demonstrate the scalability of our approach, we establish a baseline by training a model of 1.2 billion parameters on a single NVIDIA V100 32GB GPU, that sustains 39 TeraFLOPs per second over the course of the entire training application. This is 30% of the theoretical peak FLOPS for a single GPU as configured in a DGX-2H server, and thus represents a very strong baseline. Scaling the model to 8.3 billion parameters on 512 GPUs with 8-way model parallelism, we achieve up to 15.1 PetaFLOPs per second sustained over the entire application. This is 76% scaling efficiency compared to the single GPU case. Converging the model on 174 GB of text over 9.2 days requires 12 ZettaFLOPs in total. Figure 1 shows more detailed scaling results.
We analyze the accuracy of our trained models by computing perplexity on the WikiText103 dataset and cloze-style prediction accuracy on the LAMBADA dataset. We show that the WikiText103 perplexity decreases and LAMBADA accuracy increases with increasing model size and achieves state of the art results on these tasks.
In summary, our contributions are as follows:
We implement a simple and efficient model parallel approach by making only a few targeted modifications to an existing PyTorch transformer implementation.
We demonstrate convergence of an 8.3 billion parameter Transformer-based Language Model, which is the largest transformer-based neural language model that has been published.
We showcase that our models further advance the state of the art (SOTA) in natural language processing by achieving SOTA perplexity on WikiText103 (10.8 ppl) and SOTA accuracy on the LAMBADA dataset (66.5%).
We perform an in-depth empirical analysis of our model and data parallel technique and demonstrate up to 76% scaling efficiency using 512 GPUs.
We open source our code along with the training and evaluation pipelines at https://github.com/nvidia/megatron-lm
Pretrained language models have become an indispensable part of NLP researchers’ toolkits. Leveraging large corpus pretraining to learn robust neural representations of language is an active area of research that has spanned the past decade. Throughout this period, the community has seen an increasing trend in the scale and complexity of these pretraining methods that have steadily advanced the state of the art. Early examples of pretraining and transferring neural representations of language demonstrated that pretrained word embedding tables improve downstream task results compared to word embedding tables learned from scratch (Mikolov et al., 2013; Pennington et al., 2014; Turian et al., 2010). Later work advanced research in this area by learning and transferring neural models that capture contextual representations of words (Melamud et al., 2016; McCann et al., 2017; Peters et al., 2018; Radford et al., 2017; 2019b). Recent parallel work (Ramachandran et al., 2016; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Liu et al., 2019b; Dai et al., 2019; Yang et al., 2019; Liu et al., 2019a) further builds upon these ideas by not just transferring the language model to extract contextual word representations, but by also finetuning the language model in an end to end fashion on downstream tasks.
Through these works, the state of the art has advanced from transferring just word embedding tables to transferring entire 1.5B parameter language models. This progression of methods has necessitated the need for hardware, systems techniques, and frameworks that are able to operate efficiently at scale and satisfy increasing computational needs. Our work aims to provide the tools necessary to take another step forward in this trend.
Language modeling is a central task in natural language processing and language understanding. It is widely used in many applications such as speech recognition, question and answering, and summarization. To model the sequential nature of language, recurrent neural networks (RNNs/LSTMs (Hochreiter & Schmidhuber, 1997)) have been used for more than a decade. However, due to their inability to model long range dependencies and the sequential nature of these models (processes tokens one by one), efficient training on large corpora has been challenging. Recently, new approaches based on attention modules, called transformers (Vaswani et al., 2017), have been introduced. These models demonstrate superior accuracy and compute efficiency. Instead of considering tokens one by one and maintaining a hidden state, transformers consider entire segments of tokens and learn how to interpret the dependencies within them.
The original transformer formulation was designed as a machine translation architecture that transforms an input sequence into another output sequence using two parts, an Encoder and Decoder. However, recent works leveraging transformers for language modeling such as GPT (Radford et al., 2018), BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019b) use only the Encoder or Decoder depending on their needs. GPT, GPT-2, and other autoregressive transformer language models (Dai et al., 2019; Yang et al., 2019) employ multi-layer transformer decoder architectures. Our work focuses on architectures similar to GPT-2. A high level overview of the GPT-2 transformer model is presented below.
Figure 2 shows a schematic diagram of the model we used. The model consists of an input subword token embedding layer, positional embedding layer, stack of identical transformer layers, and linear output embedding layer followed by a final softmax layer. The embedding layer embeds the input subword tokens into vectors and positional encoding helps the transformer capture the order of the input tokens. The positional encoding is added to the output of the subword embedding layer, and fed to the first transformer layer. The normalized output of the last transformer layer is given to a linear layer with weights equivalent to the transpose of the input embedding. To generate probability distributions for output subword tokens a final softmax layer is applied.
As mentioned above, each transformer layer is a decoder only transformer block, which consist of a multi-head attention layer with a left-to-right attention mask, and a feed forward layer. Each multi-head attention layer consists of several attention heads that run in parallel. Each attention head uses self attention to adaptively process each token input conditioned on the other input tokens. The left-to-right attention mask ensures that a given input only attends to the positions that precede it to the left. Each head uses a fully connected layer to map each token to a key, query, and value vector of dimension hidden-size/number-of-attention-heads. Each attention head then maps a query over all key-value pairs to its output. For a single query in the sequence, this is implemented as a scaled dot product of the query with all other keys followed by a softmax that is then used to obtain a weighted sum over all values. To improve robustness of the model and aid in training convergence, attention dropout is applied to the softmax values. For a single token, the outputs of the multiple attention heads are concatenated into a vector that is the size of the hidden dimension. More details on the transformer layer can be found in (Vaswani et al., 2017).
It is worthwhile to mention that GPT-2 uses GeLU (Hendrycks & Gimpel, 2016) nonlinearities and layer normalization (Ba et al., 2016) to the input of the multi-head attention and feed forward layers, whereas the original transformer (Vaswani et al., 2017) uses ReLU nonlinearities and applies layer normalization to outputs. Therefore, in the GPT-2 model, an additional layer normalization is added after the final transformer layer.
There are two central paradigms for scaling out deep neural network training to numerous hardware accelerators: data parallelism (Valiant, 1990) where a training minibatch is split across multiple workers, and model parallelism in which the memory usage and computation of a model is distributed across multiple workers. Data parallelism has become an indispensable tool for large scale deep neural network training due to its desirable weak scaling properties. By increasing the minibatch size proportionally to the number of available workers, one observes near linear scaling in training data throughput. However, large batch training introduces complications into the optimization process that can result in reduced accuracy or longer time to convergence, offsetting the benefit of increased training throughput (Keskar et al., 2017). Further research (Goyal et al., 2017; You et al., 2017; 2019) has developed techniques to mitigate these effects and drive down training time of large neural networks from weeks and months to the order of days, hours, minutes and in some cases even seconds. To scale out training even further, parallel work (Chen et al., 2016) has combined data parallelism with activation checkpointing: recomputing activations in the backward pass without storing them in the forward pass to reduce memory requirements.
However, these techniques have one fundamental limitation in the problem size they can tackle: the model must fit entirely on one worker. With language models of increasing size and complexity like BERT and GPT-2, neural networks have approached the memory capacity of modern hardware accelerators. To continue advancing the field and train larger language models that scale well with computational resources, we must utilize model parallelism in addition to data parallelism. Within model parallelism, there are two further paradigms: layer-wise pipeline parallelism, and more general distributed tensor computation. In pipeline model parallelism, groups of operations are performed on one device before the outputs are passed to the next device in the pipeline where a different group of operations are performed. To ensure that the devices are not idle, waiting for input from other devices, pipeline parallel approaches partition (and usually increase) the batch size so that a portion of the minibatch is always being computed on a device, and all devices are being maximally utilized at all times. This approach to model parallelism mirrors the instruction pipelining found in CPUs. Some approaches (Harlap et al., 2018; Chen et al., 2018) use a parameter server (Li et al., 2014) in conjunction with pipeline parallelism. However these suffer from inconsistency issues. The GPipe framework for TensorFlow (Huang et al., 2018) overcomes this inconsistency issue by using synchronous gradient decent. This approach requires additional logic to handle the efficient pipelining of these communication and computation operations, and suffers from pipeline bubbles that reduce efficiency, or changes to the optimizer itself which impact accuracy.
Distributed tensor computation is a more general approach that partitions a tensor operation across multiple devices to accelerate computation or increase model size. FlexFlow (Jia et al., 2018), a deep learning framework orchestrating such parallel computation, provides a method to pick the best parallelization strategy. Recently, Mesh-TensorFlow (Shazeer et al., 2018) introduced a language for specifying a general class of distributed tensor computations in TensorFlow (Abadi et al., 2015). The parallel dimensions are specified in the language by the end user and the resulting graph is compiled with proper collective primitives. We utilize similar insights to those leveraged in Mesh-TensorFlow and exploit parallelism in computing the transformer’s attention heads to parallelize our transformer model. However, rather than implement a framework and compiler for model parallelism, we make only a few targeted modifications to existing PyTorch transformer implementations. Our approach is simple, does not require any new compiler or code re-wiring, and can be fully implemented by inserting a few simple primitives, as described in the next section.
We take advantage of the structure of transformer networks to create a simple model parallel implementation by adding a few synchronization primitives. A transformer layer consists of a self attention block followed by a two-layer, multi-layer perceptron (MLP) as shown in Figure 2. We introduce model parallelism in both of these blocks separately.
We start by detailing the MLP block. The first part of the block is a GEMM followed by a GeLU nonlinearity:
One option to parallelize the GEMM is to split the weight matrix along its rows and input along its columns as:
This partitioning will result in . Since GeLU is a nonlinear function, and this approach will require a synchronization point before the GeLU function.
Another option is to split along its columns . This partitioning allows the GeLU nonlinearity to be independently applied to the output of each partitioned GEMM:
This is advantageous as it removes a synchronization point. Hence, we partition the first GEMM in this column parallel fashion and split the second GEMM along its rows so it takes the output of the GeLU layer directly without requiring any communication as shown in Figure 3a. The output of the second GEMM is then reduced across the GPUs before passing the output to the dropout layer. This approach splits both GEMMs in the MLP block across GPUs and requires only a single all-reduce operation in the forward pass ( operator) and a single all-reduce in the backward pass ( operator). These two operators are conjugates of each other and can be implemented in PyTorch with only a few lines of code. As an example, the implementation of the operator is provided below:
As shown in Figure 3b, for the self attention block we exploit inherent parallelism in the multihead attention operation, partitioning the GEMMs associated with key (), query (), and value () in a column parallel fashion such that the matrix multiply corresponding to each attention head is done locally on one GPU. This allows us to split per attention head parameters and workload across the GPUs, and doesn’t require any immediate communication to complete the self-attention. The subsequent GEMM from the output linear layer (after self attention) is parallelized along its rows and takes the output of the parallel attention layer directly, without requiring communication between the GPUs. This approach for both the MLP and self attention layer fuses groups of two GEMMs, eliminates a synchronization point in between, and results in better scaling. This enables us to perform all GEMMs in a simple transformer layer using only two all-reduces in the forward path and two in the backward path (see Figure 4).
The transformer language model has an output embedding with the dimension of hidden-size () times vocabulary-size (). Since the vocabulary size is on the order of tens of thousands of tokens for modern language models (for example, GPT-2 used a vocabulary size of 50,257), it is beneficial to parallelize the output embedding GEMM. However, in transformer language models, the output embedding layer shares weights with the input embedding, requiring modifications to both. We parallelize the input embedding weight matrix along the vocabulary dimension (column-wise). Since each partition now only contains a portion of the embedding table, an all-reduce ( operator) is required after the input embedding. For the output embedding, one approach is to perform the parallel GEMM to obtain the logits, add an all-gather , and send the results to the cross-entropy loss function. However, for this case, the all-gather will communicate elements ( is the batch-size and is the sequence length) which is huge due to vocabulary size being large. To reduce the communication size, we fuse the output of the parallel GEMM with the cross entropy loss which reduces the dimension to . Communicating scalar losses instead of logits is a huge reduction in communication that improves the efficiency of our model parallel approach.
Much of our model parallel approach can be characterized as techniques aimed at reducing communication and keeping the GPUs compute bound. Rather than having one GPU compute part of the dropout, layer normalization, or residual connections and broadcast the results to other GPUs, we choose to duplicate the computation across GPUs. Specifically, we maintain duplicate copies of layer normalization parameters on each GPU, and take the output of the model parallel region and run dropout and residual connection on these tensors before feeding them as input to the next model parallel regions. To optimize the model we allow each model parallel worker to optimize its own set of parameters. Since all values are either local to or duplicated on a GPU, there is no need for communicating updated parameter values in this formulation.
In summary, our approach as described above is simple to implement, requiring only a few extra all-reduce operations added to the forward and backward pass. It does not require a compiler, and is orthogonal and complementary to the pipeline model parallelism advocated by approaches such as (Huang et al., 2018). In the remainder of this section, we describe some implementation details of our method.
Techniques that utilize random number generation, such as dropout (Srivastava et al., 2014), are a staple of modern deep learning training. Transformers have dropout layers outside the model parallel regions before residual connections and within model parallel regions in the self attention block. Because some dropout layers are in a model parallel region, while others are not, we need to treat random number generation carefully to ensure dropout works correctly. To synchronize residual connection dropout across model parallel workers we seed the random number generators at the beginning of training with the same seed. This results in identical dropout patterns across all model parallel workers. However, dropout within a model parallel region should result in different random patterns for each worker to achieve randomness across the entire operation. To achieve this we maintain a separate random number generator for dropout within model parallel regions. This random number generator is uniquely seeded for each model parallel worker.
Model parallelism is orthogonal to data parallelism, and so we can use both simultaneously to train large models in a reasonable amount of time. Figure 5 shows a grouping of GPUs for hybrid model and data parallelism. Two or more GPUs within the same server form model parallel groups (for example GPUs 1 to 8 in Figure 5), and contain one instance of the model distributed across these GPUs. The remaining GPUs, which could be within the same server but more typically are located in other servers, run additional model parallel groups. GPUs with the same position in each of the model parallel groups (for example GPUs 1, 9, …, 505 in Figure 5) form data parallel groups so that all GPUs within a data parallel group hold the same model parameters. During back propagation we run multiple gradient all-reduce operations in parallel to reduce weight gradients within each distinct data parallel group. The total number of required GPUs is the product of the number of model and data parallel groups. For example, for the 8.3 billion parameter model we use 8 GPUs per model parallel group and 64-way data parallelism, for a total of 512 GPUs. All communication is implemented in PyTorch by Python calls to NCCL. GPUs within each model parallel group perform all-reduces amongst all GPUs within the group. For data parallelism, each of the all-reduce operations takes place with one of the GPUs from each model parallel group.
Language modeling is a central task in natural language processing and language understanding. There are several formulations of language modeling, but we focus on the most commonly used formulation: for the sequence , predict the next token , given the previous tokens. Using cross-entropy loss, we minimize
where is the context used and is the conditional probability of the next token.
To train our model we follow a procedure largely based on the training procedures described in (Radford et al., 2018; 2019b; 2019a) with a few additions. All training is performed with sequences of 1024 subword units at a batch size of 512 for 300k iterations. To train our models efficiently we utilize mixed precision training with dynamic loss scaling to take advantage of the V100’s Tensor Cores (Micikevicius et al., 2017; NVIDIA, 2018). We start by initializing our weights with a simple normal distribution . We then scale weights immediately before residual layers by where N is the number of transformer layers comprised of self attention and MLP blocks. For our optimizer we utilize an Adam optimizer (Kingma & Ba, 2014) with weight decay (Loshchilov & Hutter, 2019) . Additionally, we use global gradient norm clipping of 1.0 to improve the stability of training large models. Our learning rate of 1.5e-4 utilizes a warmup period of 3000 iterations before following a single cycle cosine decay over the remaining 297k iterations. We stop the decay at a minimum learning rate of 1e-5. In our experiments we found that tuning the learning rate of a particular model via cross-validation improved accuracies, but for the sake of simplicity we consider one learning rate across all model sizes. In all cases, a dropout of 0.1 is used. Lastly, to better manage our memory footprint we utilize activation checkpointing (Chen et al., 2016) after every transformer layer.
To collect a large diverse training set with longterm dependencies we aggregate several of the largest language modeling datasets. We create an aggregate dataset consisting of Wikipedia (Devlin et al., 2018), CC-Stories (Trinh & Le, 2018), RealNews (Zellers et al., 2019), and OpenWebtext. To avoid training set leakage into our downstream tasks we remove the Wikipedia articles present in the WikiText103 test set (Merity et al., 2016). We also remove unnecessary newlines from the CC-Stories corpus introduced by preprocessing artifacts.
For OpenWebText, we created a dataset downloaded from Reddit, a social media platform, for training. The dataset is conceptually similar to the webtext dataset used in (Radford et al., 2019b). We scraped this dataset with the publicly available OpenWebText codebase11 1 https://github.com/eukaryote31/openwebtext. We first scraped all the outgoing URLs with at least 3 karma score from Reddit. We then filtered out URLs with blacklisted domains (e.g. adult content or image and video hosting sites), blacklisted types (e.g. jpg, exe, ppt, etc), and duplicate URLs. We used the newspaper library to download the text from each URL then applied langdetect22 2 https://pypi.org/project/langdetect/ to filter out non-english content and ftfy33 3 https://ftfy.readthedocs.io/en/latest/ to normalize unicode text.
We combined all the datasets and then filtered out all the documents with content length less than 128 tokens from the aggregated dataset. Since similar content might appear multiple times in the aggregated datasets, we used LSH to deduplicate content with a jaccard similarity greater than 0.7. The resulting aggregate corpus contains 174 GB of deduplicated text.
To ensure we do not train on any data found in our test sets, we calculate the percentage of test set 8-grams that also appear in our training set as done in previous work (Radford et al., 2019b). To calculate the overlap we also use a Bloom filter but with a more conservative false positive rate of to save computation cost. The WikiText103 test set has at most overlap and the LAMBADA test set (Paperno et al., 2016) has at most overlap. We should note that the WikiText103 test set has already overlap with the WikiText103 training set (Radford et al., 2019b). As these are consistent with previous work, we are confident that no documents from our test data are inadvertently included in our training data.
For training, we apply byte-pair encoding tokenization (Bojanowski et al., 2017; Radford et al., 2019b) and randomly split this dataset into a 29:1 ratio to obtain training (168.2 GB) and validation (5.8 GB) sets, respectively. We divide the training set into 5 equal shards and shuffle the processing order of these shards. Within each shard, we shuffle the order of the documents and add an end of text token at the end of each document. We then flatten the entire shard and chunk it into 1024 token portions. We then shuffle these chunks randomly one last time before presenting them to the model. We repeat this randomization process every epoch.
To analyze the effect increasing model size has on a model’s ability to understand language, we need suitable evaluation criteria. Two commonly utilized criteria are language model perplexity on the WikiText103 dataset (Merity et al., 2016) and cloze-style prediction accuracy on the LAMBADA dataset(Paperno et al., 2016).
WikiText103 perplexity is an evaluation criterion that has been well studied over the past few years since the creation of the benchmark dataset. Perplexity is the exponentiation of the average cross entropy of a corpus (Mikolov et al., 2011). This makes it a natural evaluation metric for language models which represent a probability distribution over entire sentences or texts.
To calculate perplexity in (4) we tokenize the WikiText103 test corpus according to our subword vocabulary and sum the cross entropy loss from each token . We then normalize the cross entropy loss by the number of tokens in the original tokenization scheme . The WikiText103 test corpus already comes pre-tokenized with word level tokens that prior works have used to compute perplexity. To evaluate our models’ perplexities on a level playing field with prior works we must normalize by the original number of tokens, , rather than the number of tokens, , actually in the tokenized data fed as input to our model. This pre-tokenization also introduces artifacts in the text that are not present in our training data. To alleviate this distributional mismatch, we first preprocess the WikiText103 test dataset with invertible detokenizers to remove various artifacts related to punctuation and whitespace. The value of is calculated before this preprocessing. For WikiText103’s test set and .
We must also make one further transformer-specific modification to the perplexity calculation. Unlike RNN-based language models, transformers operate on a fixed window input size. Therefore they cannot fully calculate and can only calculate where is the size of our context: 1024 tokens. However, calculating this value for every token in our dataset is prohibitively expensive since we must compute approximately evaluations of a sized context. To evaluate our models efficiently we take a middle ground approach termed overlapping evaluation where we advance the sliding window by some overlap each time and only compute the cross entropy losses corresponding to the last tokens of the window. In our experiments we utilize an overlap of 32, and compute losses over all sliding windows in such a fashion.
The capability to handle long term contexts is crucial for state of the art language models and is a necessary prerequisite for problems like long-form generation and document-based question answering. Cloze-style datasets like LAMBADA are designed to measure a model’s ability to operate in and reason about these types of long term contexts. Cloze-style reading comprehension uses a context of word tokens with one token masked; the models objective is to correctly predict the value of the missing token. To accurately predict the missing token, the model requires an in-depth understanding of the surrounding context and how language should be used in such a context. LAMBADA uses cloze-style reading comprehension to test generative left-to-right language models by constructing examples of 4-5 sentences where the last word in the context is masked. Our models utilize subword units, so for LAMBADA evaluation we utilize the raw, unprocessed LAMBADA dataset and require that our model predict the multiple subword tokens that make up the word token. We use teacher forcing, and consider an answer correct only when all output predictions are correct. This formulation is equivalent to the original task of word token prediction.
All of our experiments are conducted on NVIDIA’s DGX SuperPod44 4 See https://devblogs.nvidia.com/dgx-superpod-world-record-supercomputing-enterprise/ and we use up to 32 DGX-2H servers (a total of 512 Tesla V100 SXM3 32GB GPUs). This system is optimized for multi-node deep learning applications, with 300 GB/sec bandwidth between GPUs inside a server via NVSwitch and 100 GB/sec of interconnect bandwidth between servers using 8 InfiniBand adapters per server.
To test the scalability of our implementation, we consider GPT-2 models with four sets of parameters detailed in Table 1. To have consistent GEMM sizes in the self attention layer, the hidden size per attention head is kept constant at 96 while the number of heads and layers are varied to obtain configurations ranging from 1 billion to 8 billion parameters. The configuration with 1.2 billion parameters fits on a single GPU whereas the 8 billion parameter model requires 8-way model parallelism (8 GPUs). The original vocabulary size was 50,257, however, to have efficient GEMMs for the logit layer, it is beneficial for the per-GPU vocabulary size to be a multiple of 128. Since we study up to 8-way model parallelism, we pad the vocabulary such that it is divisible by , resulting in a padded vocabulary size of 51,200. We study both model and model+data parallel scaling. For the model parallel scaling, a fixed batch size of 8 is used across all configurations. Data parallel scaling is necessary for training many state of the art models which typically use a much larger global batch size. To this end, for the model+data parallel cases we fix the global batch size to 512 for all experiments which corresponds to 64-way data parallelism.
Throughout this section, we will showcase weak scaling with respect to the model parameters for both model parallel and model+data parallel cases. Weak scaling is typically done by scaling the batch-size, however, this approach does not address training large models that do not fit on a single GPU and it leads to training convergence degradation for large batch sizes. In contrast, here we use weak scaling to train larger models that were not possible otherwise. The baseline for all the scaling numbers is the first configuration (1.2 billion parameters) in Table 1 running on a single GPU. This is a strong baseline as it achieves 39 TeraFLOPS during the overall training process, which is 30% of the theoretical peak FLOPS for a single GPU in a DGX-2H server.
Figure 6 shows scaling values for both model and model+data parallelism. We observe excellent scaling numbers in both settings. For example, the 8.3 billion parameters case with 8-way (8 GPU) model parallelism achieves 77% of linear scaling. Model+data parallelism requires further communication of gradients and as a result the scaling numbers drop slightly. However, even for the largest configuration (8.3 billion parameters) running on 512 GPUs, we achieve 74% scaling relative to linear scaling of the strong single GPU baseline configuration (1.2 billion parameters).
This section studies the effect of attention heads on model parallel scaling. To this end, we consider the 8.3 billion parameter configuration with 8-way model parallelism and vary the number of heads from 16 to 32. The results are presented in Table 2. As the number of attention heads increases, some of the GEMMS inside the self-attention layer become smaller and also the number of elements in the self attention softmax increases. This results in a slight decrease in scaling efficiency. Future research should be wary of this hyperparameter to design large transformer models that balance model speed and model accuracy.
|Attention heads||Hidden size per head||Scaling Efficiency|
Our model parallelism is primarily designed to enable training models larger than what can fit in the memory of a single GPU, but it can also accelerate the training of smaller models without increasing the batch size. To measure this acceleration we train a model with a fixed 1.2 billion parameters. We use a fixed batch size of 8 samples per iteration and increase the number of GPUs using model parallelism. The results are listed in Table 3. Using two GPUs makes training faster. Above that we see diminishing returns as the per-GPU computation decreases and the memory bandwidth and communication overheads begin to dominate.
|# of GPUs||1||2||4||8|
To demonstrate that large language models can further advance the state of the art, we consider training models of the sizes and configurations listed in Table 4. The 355M model is equivalent in size and configuration of BERT-Large model (Devlin et al., 2018). The 2.5B model is bigger than the previous largest GPT-2 model, and the 8.3B model is larger than any transformer model ever trained, to the best of our knowledge. To train and evaluate our language models we use the procedure described in section id1. Table 4 also lists the time it takes to advance one epoch which is equivalent to 68,507 iterations. For example, for the 8.3B model on 512 GPUs, each epoch takes around two days. Compared to the configurations used for our scaling studies in Table 1, the 2.5B model is the same, the 8.3B model has 24 attention heads instead of 32, and the 355M is much smaller than any seen previously while still using 64 GPUs to train, leading to the much lower time per epoch.
Figure 7 shows validation perpelixity as a function of number of iterations. As the model size increases, the validation perpelixity decreases and reaches a validation perplexity of 9.27 for the 8.3B model. We report the zero-shot evalution of the trained models on the LAMBADA and WikiText103 datasets in Table 5. We observe the trend that increasing model size also leads to lower perplexity on WikiText103 and higher cloze accuracy on LAMBADA. Our 8.3B model achieves state of the art perplexity on the WikiText103 test set at a properly adjusted perplexity of 10.81. At 66.51% accuracy, the 8.3B model similarly surpasses prior cloze accuracy results on the LAMBADA task. Several samples generated from the model are provided in the appendix.
There are several directions for future work. Continuing to increase the scale of pretraining is a promising line of investigation that will further test existing deep learning hardware and software. To realize this, improvements in the efficiency and memory footprint of optimizers will be needed. Scaling batch size is another approach to improve efficiency of training. However, larger batch sizes require more memory for activations and approaches such as gradient accumulation will be needed. In addition, training a model with more than 16 billion parameters will demand more memory than is available within 16 GPUs of a DGX-2H box. For such models, a hybrid intra-layer and inter-layer model parallelism along with inter-node model parallelism would be more suitable.
Increasing the scale of GPT-2 pretraining and transfer is not the only way to demonstrate the effectiveness of large scale language modeling. To this end three directions of investigation include (a) pretraining different model families (e.g. BERT, Transformer-XL, and XLNet), (b) evaluating performance of large models across more difficult and diverse downstream tasks (e.g. Question Answering, Summarization, and Conversation), and (c) using knowledge distillation to train small student models from our large pretrained teacher models.
In this work, we trained the world’s largest transformer based language model using existing deep learning hardware, software, and models. In doing so, we successfully surpassed the limitations posed by traditional single-GPU-per-model training by implementing a simple and efficient model parallel approach with only a few targeted modifications to the existing PyTorch transformer implementations. We efficiently trained an 8.3 billion parameter language model (24x and 5.6x larger than the size of BERT and GPT-2, respectively) on 512 NVIDIA V100 GPUs with 8-way model parallelism and achieved up to 15.1 PetaFLOPs per second sustained over the entire application. With weak scaling, we found that increasingly large transformer models can be trained in a similar amount of time compared to their smaller counterparts and can demonstrably improve application accuracies. Our larger language models demonstrate this by achieving far superior results on downstream tasks and establish new SOTA for WikiText103 and LAMBADA datasets. Finally, we open sourced our code to enable future work leveraging model parallel transformers and further advance the state of the art on various downstream NLP applications.
Acknowledgements We would like to thank Dr. Julie Bernauer and Dr. Mike Houston for help with the training infrastructure.
- Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layernorm. CoRR, abs/1607.06450, 2016. URL http://arxiv.org/abs/1607.06450.
- Bojanowski et al. (2017) Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
- Chen et al. (2018) Chen, C.-C., Yang, C.-L., and Cheng, H.-Y. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv:1809.02839, 2018.
- Chen et al. (2016) Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016. URL http://arxiv.org/abs/1604.06174.
- Dai et al. (2019) Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860, 2019. URL http://arxiv.org/abs/1901.02860.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
- Harlap et al. (2018) Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Devanur, N., Ganger, G., and Gibbons, P. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv:1806.03377, 2018.
- Hendrycks & Gimpel (2016) Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606.08415.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Howard & Ruder (2018) Howard, J. and Ruder, S. Fine-tuned language models for text classification. CoRR, abs/1801.06146, 2018.
- Huang et al. (2018) Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018. URL http://arxiv.org/abs/1811.06965.
- Jia et al. (2018) Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model parallelism for deep neural networks. arXiv:1807.05358, 2018.
- Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large- batch training for deep learning: Generalization gap and sharp minima. ICLR, 2017.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krause et al. (2019) Krause, B., Kahembwe, E., Murray, I., and Renals, S. Dynamic evaluation of transformer language models. arXiv:1904.08378, 2019.
- Li et al. (2014) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. pp. 583–598, 2014.
- Liu et al. (2019a) Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neural networks for natural language understanding. CoRR, abs/1901.11504, 2019a. URL http://arxiv.org/abs/1901.11504.
- Liu et al. (2019b) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019b. URL http://arxiv.org/abs/1907.11692.
- Loshchilov & Hutter (2019) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- McCann et al. (2017) McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned in translation: Contextualized word vectors. CoRR, abs/1708.00107, 2017.
- Melamud et al. (2016) Melamud, O., Goldberger, J., and Dagan, I. context2vec: Learning generic context embedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 51–61, 01 2016.
- Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016. URL http://arxiv.org/abs/1609.07843.
- Micikevicius et al. (2017) Micikevicius, P., Narang, S., Alben, J., Diamos, G. F., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. CoRR, abs/1710.03740, 2017.
- Mikolov et al. (2011) Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Černockỳ, J. Empirical evaluation and combination of advanced language modeling techniques. In Twelfth Annual Conference of the International Speech Communication Association, 2011.
- Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.
- NVIDIA (2018) NVIDIA. Mixed precision training: Choosing a scaling factor, 2018. URL https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#scalefactor.
- Paperno et al. (2016) Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. CoRR, abs/1606.06031, 2016. URL http://arxiv.org/abs/1606.06031.
- Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. 2014. URL https://www.aclweb.org/anthology/D14-1162.
- Peters et al. (2018) Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. CoRR, abs/1802.05365, 2018. URL http://arxiv.org/abs/1802.05365.
- Radford et al. (2017) Radford, A., Józefowicz, R., and Sutskever, I. Learning to generate reviews and discovering sentiment. CoRR, abs/1704.01444, 2017.
- Radford et al. (2018) Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pre-training. 2018. URL https://blog.openai.com/language-unsupervised/.
- Radford et al. (2019a) Radford, A., Child, R., Gray, S., and Sutskever, I. Generating long sequences with sparse transformers. 2019a. URL https://openai.com/blog/sparse-transformer/.
- Radford et al. (2019b) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Better language models and their implications. 2019b. URL https://openai.com/blog/better-language-models/.
- Ramachandran et al. (2016) Ramachandran, P., Liu, P. J., and Le, Q. V. Unsupervised pretraining for sequence to sequence learning. CoRR, abs/1611.02683, 2016. URL http://arxiv.org/abs/1611.02683.
- Shazeer et al. (2018) Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., Sepassi, R., and Hechtman, B. Mesh-TensorFlow: Deep learning for supercomputers. In Neural Information Processing Systems, 2018.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
- Trinh & Le (2018) Trinh, T. H. and Le, Q. V. A simple method for commonsense reasoning. CoRR, abs/1806.02847, 2018. URL http://arxiv.org/abs/1806.02847.
- Turian et al. (2010) Turian, J., Ratinov, L., and Bengio, Y. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pp. 384–394, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
- Valiant (1990) Valiant, L. G. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, 1990.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR, abs/1706.03762, 2017.
- Yang et al. (2019) Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237, 2019. URL http://arxiv.org/abs/1906.08237.
- You et al. (2017) You, Y., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. arXiv:1708.03888, 2017.
- You et al. (2019) You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv:1904.00962, 2019.
- Zellers et al. (2019) Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., and Choi, Y. Defending against neural fake news. CoRR, abs/1905.12616, 2019. URL http://arxiv.org/abs/1905.12616.
Below are some text samples generated by Megatron-LM using a context prompt. Some of the texts are cut short.
Context: Rescue workers on Tuesday found a flight recorder from the Russian plane that crashed into the Black Sea over the weekend, the defense ministry said. All 92 people aboard the Russian military’s Tu-154 plane are believed to have died Sunday morning when it crashed two minutes after taking off from the southern Russian city of Sochi. Megatron-LM: A child holds up a portrait of a dead relative at a makeshift morgue in a building belonging to a local government in the town of Petropavlovsk-Kamchatsky. A Russian Defense Ministry Tu-154 plane with 92 people on board crashed into the Black Sea on Sunday morning shortly after taking off from the southern Russian city of Sochi. (Yuri Kochetkov/EPA) According to the ministry, the plane was found at a depth of more than 1,000 feet and the recorder was damaged. ”The defense ministry has decided to hand over the recorder to the Interstate Aviation Committee, which is the main investigative body in the air crash,” the statement said. The search for the plane’s black boxes is being hampered by strong winds and rough seas. Investigators have launched a criminal probe into the crash, which officials said was likely caused by a technical fault. Russian President Vladimir Putin declared Wednesday a day of mourning for the victims.