Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation

Abstract

We consider a teacher-student model of supervised learning with afully-trained two-layer neural network whose width $k$ and input dimension $d$are large and proportional. We provide an effective theory for approximatingthe Bayes-optimal generalisation error of the network for any activationfunction in the regime of sample size $n$ scaling quadratically with the inputdimension, i.e., around the interpolation threshold where the number oftrainable parameters $kd+k$ and of data $n$ are comparable. Our analysistackles generic weight distributions. We uncover a discontinuous phasetransition separating a "universal" phase from a "specialisation" phase. In thefirst, the generalisation error is independent of the weight distribution anddecays slowly with the sampling rate $n/d^2$, with the student learning onlysome non-linear combinations of the teacher weights. In the latter, the erroris weight distribution-dependent and decays faster due to the alignment of thestudent towards the teacher network. We thus unveil the existence of a highlypredictive solution near interpolation, which is however potentially hard tofind by practical algorithms.

Quick Read (beta)

loading the full paper ...