Theory of Deep Learning III: explaining the non-overfitting puzzle

  • 2017-12-30 18:27:35
  • Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, Jack Hidary, Hrushikesh Mhaskar
  • 64

Abstract

A main puzzle of deep networks revolves around the absence of overfittingdespite overparametrization and despite the large capacity demonstrated by zerotraining error on randomly labeled data. In this note, we show that thedynamical systems associated with gradient descent minimization of nonlinearnetworks behave near zero stable minima of the empirical error as gradientsystem in a quadratic potential with degenerate Hessian. The proposition issupported by theoretical and numerical results, under the assumption of stableminima of the gradient. Our proposition provides the extension to deep networksof key properties of gradient descent methods for linear networks, that as,suggested in (1), can be the key to understand generalization. Gradient descentenforces a form of implicit regularization controlled by the number ofiterations, and asymptotically converging to the minimum norm solution. Thisimplies that there is usually an optimum early stopping that avoids overfittingof the loss (this is relevant mainly for regression). For classification, theasymptotic convergence to the minimum norm solution implies convergence to themaximum margin solution which guarantees good classification error for "lownoise" datasets. The implied robustness to overparametrization has suggestiveimplications for the robustness of deep hierarchically local networks tovariations of the architecture with respect to the curse of dimensionality.

 

Introduction (beta)

None

 

Conclusion (beta)

None