Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape

Abstract

In this paper, we study the sharpness of a deep learning (DL) loss landscapearound local minima in order to reveal systematic mechanisms underlying thegeneralization abilities of DL models. Our analysis is performed across varyingnetwork and optimizer hyper-parameters, and involves a rich family of differentsharpness measures. We compare these measures and show that the low-passfilter-based measure exhibits the highest correlation with the generalizationabilities of DL models, has high robustness to both data and label noise, andfurthermore can track the double descent behavior for neural networks. We nextderive the optimization algorithm, relying on the low-pass filter (LPF), thatactively searches the flat regions in the DL optimization landscape usingSGD-like procedure. The update of the proposed algorithm, that we call LPF-SGD,is determined by the gradient of the convolution of the filter kernel with theloss function and can be efficiently computed using MC sampling. We empiricallyshow that our algorithm achieves superior generalization performance comparedto the common DL training strategies. On the theoretical front, we prove thatLPF-SGD converges to a better optimal point with smaller generalization errorthan SGD.

Quick Read (beta)

loading the full paper ...