Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression: A Distribution-Free Analysis

Abstract

We study nonparametric regression by an over-parameterized two-layer neuralnetwork trained by gradient descent (GD) in this paper. We show that, if theneural network is trained by GD with early stopping, then the trained networkrenders a sharp rate of the nonparametric regression risk of $\cO(\eps_n^2)$,which is the same rate as that for the classical kernel regression trained byGD with early stopping, where $\eps_n$ is the critical population rate of theNeural Tangent Kernel (NTK) associated with the network and $n$ is the size ofthe training data. It is remarked that our result does not requiredistributional assumptions on the training data, in a strong contrast with manyexisting results which rely on specific distributions such as the sphericaluniform data distribution or distributions satisfying certain restrictiveconditions. The rate $\cO(\eps_n^2)$ is known to be minimax optimal forspecific cases, such as the case that the NTK has a polynomial eigenvalue decayrate which happens under certain distributional assumptions. Our resultformally fills the gap between training a classical kernel regression model andtraining an over-parameterized but finite-width neural network by GD fornonparametric regression without distributional assumptions. We also provideconfirmative answers to certain open questions or address particular concernsin the literature of training over-parameterized neural networks by GD withearly stopping for nonparametric regression, including the characterization ofthe stopping time, the lower bound for the network width, and the constantlearning rate used in GD.

Quick Read (beta)

loading the full paper ...