AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization

Abstract

Adaptive gradient methods such as AdaGrad and its variants update thestepsize in stochastic gradient descent on the fly according to the gradientsreceived along the way; such methods have gained widespread use in large-scaleoptimization for their ability to converge robustly, without the need to finetune parameters such as the stepsize schedule. Yet, the theoretical guaranteesto date for AdaGrad are for online and convex optimization, which is quitedifferent from the offline and nonconvex setting where adaptive gradientmethods shine in practice. We bridge this gap by providing strong theoreticalguarantees in batch and stochastic setting, for the convergence of AdaGrad oversmooth, nonconvex landscapes, from any initialization of the stepsize, withoutknowledge of Lipschitz constant of the gradient. We show in the stochasticsetting that AdaGrad converges to a stationary point at the optimal$O(1/\sqrt{N})$ rate (up to a $\log(N)$ factor), and in the batch setting, atthe optimal $O(1/N)$ rate. Moreover, in both settings, the constant in the ratematches the constant obtained as if the variance of the gradient noise andLipschitz constant of the gradient were known in advance and used to tune thestepsize, up to a logarithmic factor of the mismatch between the optimalstepsize and the stepsize used to initialize AdaGrad. In particular, ourresults imply that AdaGrad is robust to both the unknown Lipschitz constant andlevel of stochastic noise on the gradient, in a near-optimal sense. When thereis noise, AdaGrad converges at the rate of $O(1/\sqrt{N})$ with well-tunedstepsize, and when there is not noise, the same algorithm converges at the rateof $O(1/N)$ like well-tuned batch gradient descent.

Quick Read (beta)

loading the full paper ...