Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

Abstract

The ability of neural networks to provide `best in class' approximationacross a wide range of applications is well-documented. Nevertheless, thepowerful expressivity of neural networks comes to naught if one is unable toeffectively train (choose) the parameters defining the network. In general,neural networks are trained by gradient descent type optimization methods, or astochastic variant thereof. In practice, such methods result in the lossfunction decreases rapidly at the beginning of training but then, after arelatively small number of steps, significantly slow down. The loss may evenappear to stagnate over the period of a large number of epochs, only to thensuddenly start to decrease fast again for no apparent reason. This so-calledplateau phenomenon manifests itself in many learning tasks. The present work aims to identify and quantify the root causes of plateauphenomenon. No assumptions are made on the number of neurons relative to thenumber of training data, and our results hold for both the lazy and adaptiveregimes. The main findings are: plateaux correspond to periods during whichactivation patterns remain constant, where activation pattern refers to thenumber of data points that activate a given neuron; quantification ofconvergence of the gradient flow dynamics; and, characterization of stationarypoints in terms solutions of local least squares regression lines over subsetsof the training data. Based on these conclusions, we propose a new iterativetraining method, the Active Neuron Least Squares (ANLS), characterised by theexplicit adjustment of the activation pattern at each step, which is designedto enable a quick exit from a plateau. Illustrative numerical examples areincluded throughout.

Quick Read (beta)

loading the full paper ...