Abstract
In several recently proposed stochastic optimization methods (e.g. RMSProp,Adam, Adadelta), parameter updates are scaled by the inverse square roots ofexponential moving averages of squared past gradients. Maintaining theseper-parameter second-moment estimators requires memory equal to the number ofparameters. For the case of neural network weight matrices, we proposemaintaining only the per-row and per-column sums of these moving averages, andestimating the per-parameter second moments based on these sums. We demonstrateempirically that this method produces similar results to the baseline.Secondly, we show that adaptive methods can produce larger-than-desired updateswhen the decay rate of the second moment accumulator is too slow. We proposeupdate clipping and a gradually increasing decay rate scheme as remedies.Combining these methods and dropping momentum, we achieve comparable results tothe published Adam regime in training the Transformer model on the WMT 2014English-German machine translation task, while using very little auxiliarystorage in the optimizer. Finally, we propose scaling the parameter updatesbased on the scale of the parameters themselves.