Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

Abstract

Understanding the inductive bias and generalization properties of largeoverparametrized machine learning models requires to characterize the dynamicsof the training algorithm. We study the learning dynamics of large two-layerneural networks via dynamical mean field theory, a well established techniqueof non-equilibrium statistical physics. We show that, for large network width$m$, and large number of samples per input dimension $n/d$, the trainingdynamics exhibits a separation of timescales which implies: $(i)$~The emergenceof a slow time scale associated with the growth in Gaussian/Rademachercomplexity of the network; $(ii)$~Inductive bias towards small complexity ifthe initialization has small enough complexity; $(iii)$~A dynamical decouplingbetween feature learning and overfitting regimes; $(iv)$~A non-monotonebehavior of the test error, associated `feature unlearning' regime at largetimes.

Quick Read (beta)

loading the full paper ...