Gradient Descent Finds Global Minima of Deep Neural Networks

Abstract

Gradient descent finds a global minimum in training deep neural networksdespite the objective function being non-convex. The current paper provesgradient descent achieves zero training loss in polynomial time for a deepover-parameterized neural network with residual connections (ResNet). Ouranalysis relies on the particular structure of the Gram matrix induced by theneural network architecture. This structure allows us to show the Gram matrixis stable throughout the training process and this stability implies the globaloptimality of the gradient descent algorithm. Our bounds also shed light on theadvantage of using ResNet over the fully connected feedforward architecture;our bound requires the number of neurons per layer scaling exponentially withdepth for feedforward networks whereas for ResNet the bound only requires thenumber of neurons per layer scaling polynomially with depth. We further extendour analysis to deep residual convolutional neural networks and obtain asimilar convergence result.

Quick Read (beta)

loading the full paper ...