Abstract
We study nonparametric regression by an over-parameterized two-layer neuralnetwork trained by gradient descent (GD) in this paper. We show that, if theneural network is trained by GD with early stopping, then the trained networkrenders a sharp rate of the nonparametric regression risk of $\cO(\eps_n^2)$,which is the same rate as that for the classical kernel regression trained byGD with early stopping, where $\eps_n$ is the critical population rate of theNeural Tangent Kernel (NTK) associated with the network and $n$ is the size ofthe training data. It is remarked that our result does not requiredistributional assumptions on the training data, in a strong contrast with manyexisting results which rely on specific distributions such as the sphericaluniform data distribution or distributions satisfying certain restrictiveconditions. The rate $\cO(\eps_n^2)$ is known to be minimax optimal forspecific cases, such as the case that the NTK has a polynomial eigenvalue decayrate which happens under certain distributional assumptions. Our resultformally fills the gap between training a classical kernel regression model andtraining an over-parameterized but finite-width neural network by GD fornonparametric regression without distributional assumptions. We also provideconfirmative answers to certain open questions or address particular concernsin the literature of training over-parameterized neural networks by GD withearly stopping for nonparametric regression, including the characterization ofthe stopping time, the lower bound for the network width, and the constantlearning rate used in GD.