Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Abstract

In several recently proposed stochastic optimization methods (e.g. RMSProp,Adam, Adadelta), parameter updates are scaled by the inverse square roots ofexponential moving averages of squared past gradients. Maintaining theseper-parameter second-moment estimators requires memory equal to the number ofparameters. For the case of neural network weight matrices, we proposemaintaining only the per-row and per-column sums of these moving averages, andestimating the per-parameter second moments based on these sums. We demonstrateempirically that this method produces similar results to the baseline.Secondly, we show that adaptive methods can produce larger-than-desired updateswhen the decay rate of the second moment accumulator is too slow. We proposeupdate clipping and a gradually increasing decay rate scheme as remedies.Combining these methods and dropping momentum, we achieve comparable results tothe published Adam regime in training the Transformer model on the WMT 2014English-German machine translation task, while using very little auxiliarystorage in the optimizer. Finally, we propose scaling the parameter updatesbased on the scale of the parameters themselves.

Quick Read (beta)

loading the full paper ...