Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Abstract

Batch Normalization (BN) has become a cornerstone of deep learning acrossdiverse architectures, appearing to help optimization as well asgeneralization. While the idea makes intuitive sense, theoretical analysis ofits effectiveness has been lacking. Here theoretical support is provided forone of its conjectured properties, namely, the ability to allow gradientdescent to succeed with less tuning of learning rates. It is shown that even ifwe fix the learning rate of scale-invariant parameters (e.g., weights of eachlayer with BN) to a constant (say, $0.3$), gradient descent still approaches astationary point (i.e., a solution where gradient is zero) in the rate of$T^{-1/2}$ in $T$ iterations, asymptotically matching the best bound forgradient descent with well-tuned learning rates. A similar result withconvergence rate $T^{-1/4}$ is also shown for stochastic gradient descent.

Quick Read (beta)

loading the full paper ...