Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscapes

Abstract

A quadratic approximation of neural network loss landscapes has beenextensively used to study the optimization process of these networks. Though,it usually holds in a very small neighborhood of the minimum, it cannot explainmany phenomena observed during the optimization process. In this work, we studythe structure of neural network loss functions and its implication onoptimization in a region beyond the reach of a good quadratic approximation.Numerically, we observe that neural network loss functions possesses amultiscale structure, manifested in two ways: (1) in a neighborhood of minima,the loss mixes a continuum of scales and grows subquadratically, and (2) in alarger region, the loss shows several separate scales clearly. Using thesubquadratic growth, we are able to explain the Edge of Stability phenomenon[5] observed for the gradient descent (GD) method. Using the separate scales,we explain the working mechanism of learning rate decay by simple examples.Finally, we study the origin of the multiscale structure and propose that thenon-convexity of the models and the non-uniformity of training data is one ofthe causes. By constructing a two-layer neural network problem we show thattraining data with different magnitudes give rise to different scales of theloss function, producing subquadratic growth and multiple separate scales.

Quick Read (beta)

loading the full paper ...