Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization

Abstract

Deep convolutional neural networks are known to be unstable during trainingat high learning rate unless normalization techniques are employed. Normalizingweights or activations allows the use of higher learning rates, resulting infaster convergence and higher test accuracy. Batch normalization requiresminibatch statistics that approximate the dataset statistics but this incursadditional compute and memory costs and causes a communication bottleneck fordistributed training. Weight normalization and initialization-only schemes donot achieve comparable test accuracy. We introduce a new understanding of the cause of training instability andprovide a technique that is independent of normalization and minibatchstatistics. Our approach treats training instability as a spatial common modesignal which is suppressed by placing the model on a channel-wise zero-meanisocline that is maintained throughout training. Firstly, we apply channel-wisezero-mean initialization of filter kernels with overall unity kernel magnitude.At each training step we modify the gradients of spatial kernels so that theirweighted channel-wise mean is subtracted in order to maintain the common moderejection condition. This prevents the onset of mean shift. This new techniqueallows direct training of the test graph so that training and test models areidentical. We also demonstrate that injecting random noise throughout thenetwork during training improves generalization. This is based on the ideathat, as a side effect, batch normalization performs deep data augmentation byinjecting minibatch noise due to the weakness of the dataset approximation. Our technique achieves higher accuracy compared to batch normalization andfor the first time shows that minibatches and normalization are unnecessary forstate-of-the-art training.

Quick Read (beta)

loading the full paper ...