DNN's Sharpest Directions Along the SGD Trajectory

Abstract

Recent work has identified that using a high learning rate or a small batchsize for Stochastic Gradient Descent (SGD) based training of deep neuralnetworks encourages finding flatter minima of the training loss towards the endof training. Moreover, measures of the flatness of minima have been shown tocorrelate with good generalization performance. Extending this previous work,we investigate the loss curvature through the Hessian eigenvalue spectrum inthe early phase of training and find an analogous bias: even at the beginningof training, a high learning rate or small batch size influences SGD to visitflatter loss regions. In addition, the evolution of the largest eigenvaluesappears to always follow a similar pattern, with a fast increase in the earlyphase, and a decrease or stabilization thereafter, where the peak value isdetermined by the learning rate and batch size. Finally, we find that byaltering the learning rate just in the direction of the eigenvectors associatedwith the largest eigenvalues, SGD can be steered towards regions which are anorder of magnitude sharper but correspond to models with similargeneralization, which suggests the curvature of the endpoint found by SGD isnot predictive of its generalization properties.

Quick Read (beta)

loading the full paper ...