Gradient descent aligns the layers of deep linear networks

Abstract

This paper establishes risk convergence and asymptotic weight matrixalignment --- a form of implicit regularization --- of gradient flow andgradient descent when applied to deep linear networks on linearly separabledata. In more detail, for gradient flow applied to strictly decreasing lossfunctions (with similar results for gradient descent with particular decreasingstep sizes): (i) the risk converges to 0; (ii) the normalized i-th weightmatrix asymptotically equals its rank-1 approximation $u_iv_i^{\top}$; (iii)these rank-1 matrices are aligned across layers, meaning$|v_{i+1}^{\top}u_i|\to1$. In the case of the logistic loss (binary crossentropy), more can be said: the linear function induced by the network --- theproduct of its weight matrices --- converges to the same direction as themaximum margin solution. This last property was identified in prior work, butonly under assumptions on gradient descent which here are implied by thealignment phenomenon.

Quick Read (beta)

loading the full paper ...