Batch Normalization Explained

Abstract

A critically important, ubiquitous, and yet poorly understood ingredient inmodern deep networks (DNs) is batch normalization (BN), which centers andnormalizes the feature maps. To date, only limited progress has been madeunderstanding why BN boosts DN learning and inference performance; work hasfocused exclusively on showing that BN smooths a DN's loss landscape. In thispaper, we study BN theoretically from the perspective of functionapproximation; we exploit the fact that most of today's state-of-the-art DNsare continuous piecewise affine (CPA) splines that fit a predictor to thetraining data via affine mappings defined over a partition of the input space(the so-called "linear regions"). {\em We demonstrate that BN is anunsupervised learning technique that -- independent of the DN's weights orgradient-based learning -- adapts the geometry of a DN's spline partition tomatch the data.} BN provides a "smart initialization" that boosts theperformance of DN learning, because it adapts even a DN initialized with randomweights to align its spline partition with the data. We also show that thevariation of BN statistics between mini-batches introduces a dropout-likerandom perturbation to the partition boundaries and hence the decision boundaryfor classification problems. This per mini-batch perturbation reducesoverfitting and improves generalization by increasing the margin between thetraining samples and the decision boundary.

Quick Read (beta)

loading the full paper ...