Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks

Abstract

Neural networks trained via gradient descent with random initialization andwithout any regularization enjoy good generalization performance in practicedespite being highly overparametrized. A promising direction to explain thisphenomenon is to study how initialization and overparametrization affectconvergence and implicit bias of training algorithms. In this paper, we presenta novel analysis of single-hidden-layer linear networks trained under gradientflow, which connects initialization, optimization, and overparametrization.Firstly, we show that the squared loss converges exponentially to its optimumat a rate that depends on the level of imbalance and the margin of theinitialization. Secondly, we show that proper initialization constrains thedynamics of the network parameters to lie within an invariant set. In turn,minimizing the loss over this set leads to the min-norm solution. Finally, weshow that large hidden layer width, together with (properly scaled) randominitialization, ensures proximity to such an invariant set during training,allowing us to derive a novel non-asymptotic upper-bound on the distancebetween the trained network and the min-norm solution.

Quick Read (beta)

loading the full paper ...