Rigging the Lottery: Making All Tickets Winners

Abstract

Sparse neural networks have been shown to be more parameter and computeefficient compared to dense networks and in some cases are used to decreasewall clock inference times. There is a large body of work on training densenetworks to yield sparse networks for inference. This limits the size of thelargest trainable sparse model to that of the largest trainable dense model. Inthis paper we introduce a method to train sparse neural networks with a fixedparameter count and a fixed computational cost throughout training, withoutsacrificing accuracy relative to existing dense-to-sparse training methods. Ourmethod updates the topology of the network during training by using parametermagnitudes and infrequent gradient calculations. We show that this approachrequires fewer floating-point operations (FLOPs) to achieve a given level ofaccuracy compared to prior techniques. Importantly, by adjusting the topologyit can start from any initialization - not just "lucky" ones. We demonstratestate-of-the-art sparse training results with ResNet-50, MobileNet v1 andMobileNet v2 on the ImageNet-2012 dataset, WideResNets on the CIFAR-10 datasetand RNNs on the WikiText-103 dataset. Finally, we provide some insights intowhy allowing the topology to change during the optimization can overcome localminima encountered when the topology remains static.

Quick Read (beta)

loading the full paper ...