A unified theory of adaptive stochastic gradient descent as Bayesian filtering

Abstract

There are a diverse array of schemes for adaptive stochastic gradient descentfor optimizing neural networks, from fully factorised methods with and withoutmomentum (e.g.\ RMSProp and ADAM), to Kronecker factored methods that considerthe Hessian for a full weight matrix. However, these schemes have been derivedand justified using a wide variety of mathematical approaches, and as such,there is no unified theory of adaptive stochastic gradients descent methods.Here, we provide such a theory by showing that many successful adaptivestochastic gradient descent schemes emerge by considering a filtering-basedinference in a Bayesian optimization problem. In particular, we usebackpropagated gradients to compute a Gaussian posterior over the optimalneural network parameters, given the data minibatches seen so far. Our unifiedtheory is able to give some guidance to practitioners on how to choose betweenthe large number of available optimization methods. In the fully factorisedsetting, we recover RMSProp and ADAM under different priors, along withadditional improvements such as Nesterov acceleration and AdamW. Moreover, weobtain new recommendations, including the possibility of combining RMSProp andADAM updates. In the Kronecker factored setting, we obtain a adaptive naturalgradient adaptation scheme that is derived specifically for the minibatchsetting. Furthermore, under a modified prior, we obtain a Kronecker factoredanalogue of RMSProp or ADAM, that preconditions the gradient by whitening (i.e.by multiplying by the square root of the Hessian, as in RMSProp/ADAM). Our workraises the hope that it is possible to achieve unified theoreticalunderstanding of empirically successful adaptive gradient descent schemes forneural networks.

Quick Read (beta)

loading the full paper ...