Reducing Activation Recomputation in Large Transformer Models

Abstract

Training large transformer models is one of the most important computationalchallenges of modern AI. In this paper, we show how to significantly acceleratetraining of large transformer models by reducing activation recomputation.Activation recomputation is commonly used to work around memory capacityconstraints. Rather than storing activations for backpropagation, they aretraditionally recomputed, which saves memory but adds redundant compute. Inthis work, we show most of this redundant compute is unnecessary because we canreduce memory consumption sufficiently without it. We present two novel yetvery simple techniques: sequence parallelism and selective activationrecomputation. In conjunction with tensor parallelism, these techniques almosteliminate the need to recompute activations. We evaluate our approach onlanguage models up to one trillion parameters in scale and show that our methodreduces activation memory by 5x, while reducing execution time overhead fromactivation recomputation by over 90%. For example, when training a 530Bparameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model FlopsUtilization of 54.2%, which is 29% faster than the 42.1% we achieve usingrecomputation. Our implementation will be available in both Megatron-LM andNeMo-Megatron.

Quick Read (beta)

loading the full paper ...