Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

Abstract

We empirically demonstrate that full-batch gradient descent on neural networktraining objectives typically operates in a regime we call the Edge ofStability. In this regime, the maximum eigenvalue of the training loss Hessianhovers just above the numerical value $2 / \text{(step size)}$, and thetraining loss behaves non-monotonically over short timescales, yet consistentlydecreases over long timescales. Since this behavior is inconsistent withseveral widespread presumptions in the field of optimization, our findingsraise questions as to whether these presumptions are relevant to neural networktraining. We hope that our findings will inspire future efforts aimed atrigorously understanding optimization at the Edge of Stability. Code isavailable at https://github.com/locuslab/edge-of-stability.

Quick Read (beta)

loading the full paper ...