Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

  • 2020-01-16 18:48:34
  • Wei Hu, Lechao Xiao, Jeffrey Pennington
  • 3

Abstract

The selection of initial parameter values for gradient-based optimization ofdeep neural networks is one of the most impactful hyperparameter choices indeep learning systems, affecting both convergence times and model performance.Yet despite significant empirical and theoretical analysis, relatively littlehas been proved about the concrete effects of different initialization schemes.In this work, we analyze the effect of initialization in deep linear networks,and provide for the first time a rigorous proof that drawing the initialweights from the orthogonal group speeds up convergence relative to thestandard Gaussian initialization with iid weights. We show that for deepnetworks, the width needed for efficient convergence to a global minimum withorthogonal initializations is independent of the depth, whereas the widthneeded for efficient convergence with Gaussian initializations scales linearlyin the depth. Our results demonstrate how the benefits of a good initializationcan persist throughout learning, suggesting an explanation for the recentempirical successes found by initializing very deep non-linear networksaccording to the principle of dynamical isometry.

 

Quick Read (beta)

loading the full paper ...