Feature selection with gradient descent on two-layer networks in low-rotation regimes

Abstract

This work establishes low test error of gradient flow (GF) and stochasticgradient descent (SGD) on two-layer ReLU networks with standard initialization,in three regimes where key sets of weights rotate little (either naturally dueto GF and SGD, or due to an artificial constraint), and making use of marginsas the core analytic technique. The first regime is near initialization,specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown thatGF and SGD only need a network width and number of samples inverselyproportional to the NTK margin, and moreover that GF attains at least the NTKmargin itself, which suffices to establish escape from bad KKT points of themargin objective, whereas prior work could only establish nondecreasing butarbitrarily small margins. The second regime is the Neural Collapse (NC)setting, where data lies in extremely-well-separated groups, and the samplecomplexity scales with the number of groups; here the contribution over priorwork is an analysis of the entire GF trajectory from initialization. Lastly, ifthe inner layer weights are constrained to change in norm only and can notrotate, then GF with large widths achieves globally maximal margins, and itssample complexity scales with their inverse; this is in contrast to prior work,which required infinite width and a tricky dual convergence assumption. Aspurely technical contributions, this work develops a variety of potentialfunctions and other tools which will hopefully aid future work.

Quick Read (beta)

loading the full paper ...