Sparse Networks from Scratch: Faster Training without Losing Performance

Abstract

We demonstrate the possibility of what we call sparse learning: acceleratedtraining of deep neural networks that maintain sparse weights throughouttraining while achieving performance levels competitive with dense networks. Weaccomplish this by developing sparse momentum, an algorithm which usesexponentially smoothed gradients (momentum) to identify layers and weightswhich reduce the error efficiently. Sparse momentum redistributes prunedweights across layers according to the mean momentum magnitude of each layer.Within a layer, sparse momentum grows weights according to the momentummagnitude of zero-valued weights. We demonstrate state-of-the-art sparseperformance on MNIST, CIFAR-10, and ImageNet, decreasing the mean error by arelative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, weshow that our algorithm can reliably find the equivalent of winning lotterytickets from random initialization: Our algorithm finds sparse configurationswith 20% or fewer weights which perform as well, or better than their densecounterparts. Sparse momentum also decreases the training time: It requires asingle training run -- no re-training is required -- and increases trainingspeed up to 11.85x. In our analysis, we show that our sparse networks might beable to reach dense performance levels by learning more general features whichare useful to a broader range of classes than dense networks.

Quick Read (beta)

loading the full paper ...