Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

Abstract

We give a rigorous analysis of the statistical behavior of gradients inrandomly initialized feed-forward networks with ReLU activations. Our resultsshow that a fully connected depth $d$ ReLU net with hidden layer widths $n_j$will have exploding and vanishing gradients if and only if $\sum_{j=1}^{d-1}1/n_j$ is large. The point of view of this article is that whether a givenneural net will have exploding/vanishing gradients is a function mainly of thearchitecture of the net, and hence can be tested at initialization. Our resultsimply that a fully connected network that produces manageable gradients atinitialization must have many hidden layers that are about as wide as thenetwork is deep. This work is related to the mean field theory approach torandom neural nets. From this point of view, we give a rigorous computation ofthe $1/n_j$ corrections to the propagation of gradients at the so-called edgeof chaos.

Quick Read (beta)

loading the full paper ...