Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Abstract

Nesterov's accelerated gradient descent (AGD), an instance of the generalfamily of "momentum methods", provably achieves faster convergence rate thangradient descent (GD) in the convex setting. However, whether these methods aresuperior to GD in the nonconvex setting remains open. This paper studies asimple variant of AGD, and shows that it escapes saddle points and finds asecond-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations,faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To thebest of our knowledge, this is the first Hessian-free algorithm to find asecond-order stationary point faster than GD, and also the first single-loopalgorithm with a faster rate than GD even in the setting of finding afirst-order stationary point. Our analysis is based on two key ideas: (1) theuse of a simple Hamiltonian function, inspired by a continuous-timeperspective, which AGD monotonically decreases per step even for nonconvexfunctions, and (2) a novel framework called improve or localize, which isuseful for tracking the long-term behavior of gradient-based optimizationalgorithms. We believe that these techniques may deepen our understanding ofboth acceleration algorithms and nonconvex optimization.

Quick Read (beta)

loading the full paper ...