Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Abstract

Widely popular transformer-based NLP models such as BERT and GPT haveenormous capacity trending to billions of parameters. Current execution methodsdemand brute-force resources such as HBM devices and high speedinterconnectivity for data parallelism. In this paper, we introduce a newrelay-style execution technique called L2L (layer-to-layer) where at any givenmoment, the device memory is primarily populated only with the executinglayer(s)'s footprint. The model resides in the DRAM memory attached to either aCPU or an FPGA as an entity we call eager param-server (EPS). Unlike atraditional param-server, EPS transmits the model piecemeal to the devicesthereby allowing it to perform other tasks in the background such as reductionand distributed optimization. To overcome the bandwidth issues of shuttlingparameters to and from EPS, the model is executed a layer at a time across manymicro-batches instead of the conventional method of minibatches over wholemodel. In this paper, we explore a conservative version of L2L that isimplemented on a modest Azure instance for BERT-Large running it with a batchsize of 32 on a single V100 GPU using less than 8GB memory. Our results show amore stable learning curve, faster convergence, better accuracy and 35%reduction in memory compared to the state-of-the-art baseline. Our methodreproduces BERT results on any mid-level GPU that was hitherto not feasible.L2L scales to arbitrary depth without impacting memory or devices allowingresearchers to develop affordable devices. It also enables dynamic approachessuch as neural architecture search. This work has been performed on GPUs firstbut also targeted towards high TFLOPS/Watt accelerators such as Graphcore IPUs.The code will soon be available on github.

Quick Read (beta)

loading the full paper ...