Abstract
Traditional analyses of gradient descent show that when the largesteigenvalue of the Hessian, also known as the sharpness $S(\theta)$, is boundedby $2/\eta$, training is "stable" and the training loss decreasesmonotonically. Recent works, however, have observed that this assumption doesnot hold when training modern neural networks with full batch or large batchgradient descent. Most recently, Cohen et al. (2021) observed two importantphenomena. The first, dubbed progressive sharpening, is that the sharpnesssteadily increases throughout training until it reaches the instability cutoff$2/\eta$. The second, dubbed edge of stability, is that the sharpness hovers at$2/\eta$ for the remainder of training while the loss continues decreasing,albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descentat the edge of stability can be captured by a cubic Taylor expansion: as theiterates diverge in direction of the top eigenvector of the Hessian due toinstability, the cubic term in the local Taylor expansion of the loss functioncauses the curvature to decrease until stability is restored. This property,which we call self-stabilization, is a general property of gradient descent andexplains its behavior at the edge of stability. A key consequence ofself-stabilization is that gradient descent at the edge of stability implicitlyfollows projected gradient descent (PGD) under the constraint $S(\theta) \le2/\eta$. Our analysis provides precise predictions for the loss, sharpness, anddeviation from the PGD trajectory throughout training, which we verify bothempirically in a number of standard settings and theoretically under mildconditions. Our analysis uncovers the mechanism for gradient descent's implicitbias towards stability.