Sparse neural networks have been shown to be more parameter and computeefficient compared to dense networks and in some cases are used to decreasewall clock inference times. There is a large body of work on training densenetworks to yield sparse networks for inference. This limits the size of thelargest trainable sparse model to that of the largest trainable dense model. Inthis paper we introduce a method to train sparse neural networks with a fixedparameter count and a fixed computational cost throughout training, withoutsacrificing accuracy relative to existing dense-to-sparse training methods. Ourmethod updates the topology of the network during training by using parametermagnitudes and infrequent gradient calculations. We show that this approachrequires fewer floating-point operations (FLOPs) to achieve a given level ofaccuracy compared to prior techniques. Importantly, by adjusting the topologyit can start from any initialization - not just "lucky" ones. We demonstratestate-of-the-art sparse training results with ResNet-50, MobileNet v1 andMobileNet v2 on the ImageNet-2012 dataset, WideResNets on the CIFAR-10 datasetand RNNs on the WikiText-103 dataset. Finally, we provide some insights intowhy allowing the topology to change during the optimization can overcome localminima encountered when the topology remains static.
Quick Read (beta)
Rigging the Lottery:
Making All Tickets Winners
Sparse neural networks have been shown to be more parameter and compute efficient compared to dense networks and in some cases are used to decrease wall clock inference times. There is a large body of work on training dense networks to yield sparse networks for inference (Molchanov et al., 2017; Zhu and Gupta, 2018; Narang et al., 2017; Li et al., 2016; Guo et al., 2016). This limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. Our method updates the topology of the network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. Importantly,by adjusting the topology it can start from any initialization – not just “lucky” ones. We demonstrate state-of-the-art sparse training results with ResNet-50, MobileNet v1 and MobileNet v2 on the ImageNet-2012 dataset, WideResNets on the CIFAR-10 dataset and RNNs on the WikiText-103 dataset. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static.
The parameter and floating point operation (FLOP) efficiency of sparse neural networks is now well demonstrated on a variety of problems (Han et al., 2015; Srinivas et al., 2017). Multiple works have shown inference time speedups are possible using sparsity for both Recurrent Neural Networks (RNNs) (Kalchbrenner et al., 2018) and Convolutional Neural Networks (ConvNets) (Park et al., 2016; Elsen et al., 2019). Currently, the most accurate sparse models are obtained with techniques that require, at a minimum, the cost of training a dense model in terms of memory and FLOPs (Zhu and Gupta, 2018; Guo et al., 2016), and sometimes significantly more (Molchanov et al., 2017). This paradigm has two main limitations:
The maximum size of sparse models is limited to the largest dense model that can be trained. Even if sparse models are more parameter efficient, we can’t use pruning to train models that are larger and more accurate than the largest possible dense models.
It is inefficient. Large amounts of computation must be performed for parameters that are zero valued or that will be zero during inference.
Additionally, it remains unknown if the performance of the current best pruning algorithms are an upper bound on the quality of sparse models. Gale et al. (2019) found that three different dense-to-sparse training algorithms all achieve about the same sparsity / accuracy trade-off. However, this is far from conclusive proof that no better performance is possible. In this work we show the surprising result that dynamic sparse training, which includes the method we introduce below, can find more accurate models than the current best approaches to pruning initially dense networks. Importantly, our method does not change the FLOPs required to execute the model during training, allowing one to decide on a specific inference cost prior to training.
The Lottery Ticket Hypothesis (Frankle and Carbin, 2019) hypothesized that if we can find a sparse neural network with iterative pruning, then we can train that sparse network from scratch, to the same level of accuracy, by starting from the original initial conditions. In this paper we introduce a new method for training sparse models without the need of a “lucky” initialization; for this reason, we call our method “The Rigged Lottery” or RigL** * Pronounced ”wriggle”.. We show that this method is:
Memory efficient: It requires memory only proportional to the size of the sparse model. It never requires storing quantities that are the size of the dense model. This is in contrast to Dettmers and Zettlemoyer (2019) which requires storing the momentum for all parameters, even those that are zero valued.
Computationally efficient: The amount of computation required to train the model is proportional to the number of nonzero parameters in the model.
Accurate: The performance achieved by the method matches and sometimes exceeds the performance of pruning based approaches.
Our method works by infrequently using instantaneous gradient information to inform a re-wiring of the network. We show that this allows the optimization to escape local minima where it would otherwise become trapped if the sparsity pattern were to remain static. Crucially, as long as the full gradient information is needed less than every iterations, then the overall work remains proportional to the model sparsity.
Finally, in appendix B we compare RigL with state-of-the-art structured Bayesian pruning methods. We show that sparse training requires far fewer resources when initially training or “compressing” a network and that the final architecture found by RigL is smaller and requires fewer FLOPs for inference. This remains true even when only considering the cost of training the smaller dense network found by the structured pruning methods.
2 Related Work
Research on finding sparse neural networks dates back decades, at least to Thimm and Fiesler (1995) who concluded that pruning weights based on magnitude was a simple and powerful technique. Ström (1997) later introduced the idea of retraining the previously pruned network to increase accuracy. Han et al. (2016b) went further and introduced multiple rounds of magnitude pruning and retraining. This is, however, relatively inefficient, requiring ten rounds of retraining when removing of the connections to reach a final sparsity of . To overcome this problem, Narang et al. (2017) introduced gradual pruning, where connections are slowly removed over the course of a single round of training. Zhu and Gupta (2018) refined the technique to minimize the amount of hyper-parameter selection required.
|Method||Drop||Grow||Selectable FLOPs||Space & FLOPs|
A diversity of approaches not based on magnitude based pruning have also been proposed. LeCun et al. (1990) and Hassibi and Stork (1993) are some early examples, but impractical for modern neural networks as they use information from the Hessian to prune a trained network. More recent work includes Regularization (Christos Louizos, 2018), Variational Dropout (Molchanov et al., 2017), Dynamic Network Surgery (Guo et al., 2016) and Sensitivity Driven Regularization (Tartaglione et al., 2018). Gale et al. (2019) examined magnitude pruning, Regularization and Variational Dropout and concluded that they all achieve about the same accuracy versus sparsity trade-off on ResNet-50 and Transformer architectures.
Training techniques that allow for sparsity throughout the entire training process were, to our knowledge, first introduced in Deep Rewiring (DeepR) (Bellec et al., 2018). In DeepR, the standard Stochastic Gradient Descent (SGD) optimizer is augmented with a random walk in parameter space. Additionally, connections have a pre-defined sign assigned at random; when the optimizer would normally flip the sign, the weight is set to 0 instead and new weights are activated at random.
Sparse Evolutionary Training (SET) (Mocanu et al., 2018) proposed a simpler scheme where weights are pruned according to the standard magnitude criterion used in pruning and are added back at random. The method is simple and achieves reasonable performance in practice. Dynamic Sparse Reparameterization (DSR) (Mostafa and Wang, 2019) introduced the idea of allowing the parameter budget to shift between different layers of the model, allowing for non-uniform sparsity. This allows the model to distribute parameters where they are most effective. Unfortunately, the models under consideration are mostly convolutional networks, so the result of this parameter reallocation (which is to decrease the sparsity of early layers and increase the sparsity of later layers) has the overall effect of increasing the FLOP count because the spatial size is largest at the beginning. Sparse Networks from Scratch (SNFS) (Dettmers and Zettlemoyer, 2019) introduces the idea of using the momentum of each parameter as the criterion to be used for growing weights and demonstrates it leads to an improvement in test accuracy. Like DSR, they allow the sparsity of each layer to change and focus on a constant parameter, not FLOP, budget. Importantly, the method requires computing gradients and updating the momentum for every parameter in the model, even those that are zero, at every iteration. This can result in a significant amount of overall computation. Additionally, depending on the model and training setup, the required storage for the full momentum tensor could be prohibitive. Single-Shot Network Pruning (SNIP) (Lee et al., 2019) attempts to find an initial mask with one-shot pruning and uses the saliency score of parameters to decide which parameters to keep. After pruning training proceeds with this static sparse network. Properties of the different sparse training techniques are summarized in Table 1.
There has also been a line of work investigating the Lottery Ticket Hypothesis (Frankle and Carbin, 2019). Frankle et al. (2019) showed that the formulation must be weakened to apply to larger networks such as ResNet-50 (He et al., 2015). In large networks, instead of the original initialization, the values after thousands of optimization steps must be used for initialization. Zhou et al. (2019) showed that lottery tickets obtain non-random accuracies even before the training has started. Though the possibility of training sparse neural networks with a fixed sparsity mask using lottery tickets is intriguing, it remains unclear whether it is possible to generate such initializations – for both masks and parameters – de novo.
3 Rigging The Lottery
Our method, RigL, is illustrated in Figure 1. At regularly spaced intervals our method removes a fraction of connections based on weight magnitudes and activates new ones using instantaneous gradient information. After updating the connectivity, training continues with the updated network until the next update. The main parts of our algorithm, Sparsity Distribution, Update Schedule, Drop Criterion, Grow Criterion, and the various options we considered for each, are explained below. The improved performance of RigL is due to two reasons: the use of a new method for activating connections that is efficient and more effective than choosing at random, and the use of a natural extension to an existing method for distributing parameters statically among convolutional layers.
(0) Notation. Given a dataset with individual inputs and targets , one can train a neural network to minimize the loss function , where is the neural network with parameters of length . The vector can be decomposed into parameters , of length , for each layer . A sparse network keeps only a fraction of all connections, resulting in a sparsity of . More precisely, denoting the sparsity of individual layers with , the total parameter count of the sparse neural network satisfies .
(1) Sparsity Distribution. There are many ways of distributing the non-zero weights across the layers while satisfying the equality above. We avoid re-allocating parameters between layers during the training process as it makes it difficult to target a specific final FLOP budget, which is important for many inference applications. We consider the following three strategies:
Uniform: The sparsity of each individual layer is the same as the total sparsity .
Erdős-Rényi: As introduced in Mocanu et al. (2018), scales with , where denotes number of neurons at layer . This enables the number of connections in a sparse layer to scale with the sum of the number of output and input channels.
Erdős-Rényi-Kernel (ERK): This method modifies the original Erdős-Rényi formulation by including the kernel dimensions in the scaling factors. In other words, the number of parameters of the sparse convolutional layers are scaled proportional to , where and are the width and the height of the ’th convolutional kernel. Sparsity of the fully connected layers scale as in the original Erdős-Rényi formulation. Similar to Erdős-Rényi, ERK allocates higher sparsities to the layers with more parameters while allocating lower sparsities to the smaller ones.
In all methods, the bias and batch-norm parameters are kept dense.
(2) Update Schedule. The update schedule is defined by the following parameters: (1) the number of iterations between sparse connectivity updates (), (2) the iteration at which to stop updating the sparse connectivity (), (3) the initial fraction of connections updated () and (4) a function , invoked every iterations until , possibly decaying the fraction of updated connections over time. For the latter we choose to use cosine annealing, as we find it slightly outperforms the other methods considered.
Results for the other annealing schedules we considered, such as constant and inverse power, are presented in Appendix G.
(3) Drop criterion. Every steps we drop the connections given by
, where gives the indices and values of the top- elements of vector .
(4) Grow criterion. The novelty of our method lies in how we grow new connections. We grow the connections with highest magnitude gradients, , where is the set of active connections after the drop step. Newly activated connections are initialized to zero and therefore don’t affect the output of the network. However they are expected to receive gradients with high magnitudes in the next iteration and therefore reduce the loss fastest. This procedure can be applied to each layer in sequence and the dense gradients can be discarded immediately after selecting the top connections. If a layer is too large to materialize the full gradient with respect to the weights, then we can further reduce the memory requirements by performing an iterative calculation:
Initialize the set .
Materialize a subset of size M of the full gradient, which we denote .
Update to contain the Top-K elements of concatenated with .
Repeat steps 1 through 3 until all of the gradients have been materialized. The final set contains the connections we wish to grow.
As long as the total work in calculating dense gradients is amortized and still proportional to . This is in contrast to the method of Dettmers and Zettlemoyer (2019), which requires calculating and storing the full gradients at each optimization step.
4 Empirical Evaluation
Our experiments include image classification using CNNs on the ImageNet-2012 (Russakovsky et al., 2015) and CIFAR-10 (Krizhevsky et al., ) datasets and character based language modelling using RNNs with the WikiText-103 dataset (Merity et al., 2016). We use the TensorFlow Model Pruning library (Zhu and Gupta, 2018) for our pruning baselines. A Tensorflow (Abadi et al., 2015) implementation of our method along with three other baselines (SET, SNFS, SNIP) can be found here. When increasing the training steps by a factor , the anchor epochs of the learning rate schedule and the end iteration of the mask update schedule are also scaled by the same factor; we indicate this scaling with a subscript (e.g. RigL).
4.1 ImageNet-2012 Dataset
In all experiments in this section, we use SGD with momentum as our optimizer. We set the momentum coefficient of the optimizer to 0.9, regularization coefficient to 0.0001, and label smoothing (Szegedy et al., 2016) to 0.1. The learning rate schedule starts with a linear warm up reaching its maximum value of 1.6 at epoch 5 which is then dropped by a factor of 10 at epochs 30, 70 and 90. We train our networks with a batch size of 4096 for 32000 steps which roughly corresponds to 100 epochs of training. Our training pipeline uses standard data augmentation, which includes random flips and crops. For all the networks in this section, when using uniform layer-wise sparsity, we keep the very first layer dense.
Figure 2-left summarizes the performance of various methods on training an 80% sparse ResNet-50. We also train small dense networks with equivalent parameter count. All sparse networks use a uniform layer-wise sparsity distribution unless otherwise specified and a cosine update schedule (, ). Overall, we observe that the performance of all methods improves with training time; thus, for each method we run extended training with up to 5 the training steps of the original.
|FLOPs (Train)||FLOPs (Test)||Top-1 Accuracy||FLOPs (Train)||FLOPs (Test)|
|Dense||76.80.09||1x (3.2e18)||1x (8.2e9)|
As noted by Gale et al. (2019), Evci et al. (2019), Frankle et al. (2019), and Mostafa and Wang (2019), training a network with fixed sparsity from scratch (Static) leads to inferior performance. Training a small dense network with the same number of parameters gets better results than Static, but fails to match the performance of dynamic sparse models. Similarly SET improves the performance over Small-Dense, however saturates around 75% accuracy indicating the limits of growing new connections randomly. Methods that use gradient information to grow new connections (RigL and SNFS) obtain higher accuracies, but RigL achieves the highest accuracy and does so while consistently requiring fewer FLOPs than the other methods.
Given that different applications or scenarios might require a limit on the number of FLOPs for inference, we investigate the performance of our method at various sparsity levels. As mentioned previously, one strength of our method is that its resource requirements are constant throughout training and we can choose the level of sparsity that fits our training and/or inference constraints. In Figure 2-right we show the performance of our method at different sparsities and compare them with the pruning results of Gale et al. (2019), which uses 1.5x training steps, relative to the original 32k iterations. To make a fair comparison with regards to FLOPs, we scale the learning schedule of all other methods by 5x. Note that even after extending the training, it takes less FLOPs to train sparse networks using RigL (except for the 80% sparse RigL-ERK) compared to the pruning method.
RigL, our method with constant sparsity distribution, exceeds the performance of magnitude based iterative pruning in all sparsity levels while requiring less FLOPs to train. Sparse networks that use Erdős-Renyi-Kernel (ERK) sparsity distribution obtains even greater performance. For example ResNet-50 with 96.5% sparsity achieves a remarkable 72.75% Top-1 Accuracy, around 3.5% higher than the extended magnitude pruning results reported by Gale et al. (2019). As observed earlier, smaller dense models (with the same number of parameters) or sparse models with a static connectivity can not perform at a comparable level.
A more fine grained comparison of sparse training methods is presented in Table 2. Methods using uniform sparsity distribution and whose FLOP/memory footprint scales directly with (1-S) are placed in the first sub-group of the table. The second sub-group includes DSR and networks with ERK sparsity distribution which require a higher number of FLOPs for inference with same parameter count. The final sub-group includes methods that require the space and the work proportional to training a dense model.
MobileNet is a compact architecture that performs remarkably well in resource constrained settings. Due to its compact nature with separable convolutions it is known to be difficult to sparsify (Zhu and Gupta, 2018). In this section we apply our method to MobileNet-v1 (Howard et al., 2017) and MobileNet-v2 (Sandler et al., 2018). Due to its low parameter count we keep the first layer dense, and use ERK and Uniform sparsity distributions to sparsify the remaining layers.
The performance of sparse MobileNets trained with RigL as well as the baselines are shown in Figure 3. We do extended training (5x of the original number of steps) for all runs in this section. Although MobileNets are more sensitive to sparsity compared to the ResNet-50 architecture, RigL successfully trains sparse MobileNets at high sparsities and exceeds the performance of previously reported pruning results.
To demonstrate the advantages of sparse models, next, we train wider MobileNets while keeping the FLOPs and total number of parameters the same as the dense baseline using sparsity. A sparse MobileNet-v1 with width multiplier 1.98 and constant 75% sparsity has the same FLOPs and parameter count as the dense baseline. Training this network with RigL yields an impressive 4.3% absolute improvement in Top-1 Accuracy.
4.2 Character Level Language Modelling
Most prior work has only examined sparse training on vision networks (the exception is the earliest work - Deep Rewiring (Bellec et al., 2018) which trained an LSTM (Hochreiter and Schmidhuber, 1997) on the TIMIT (Garofolo et al., 1993) dataset). To fully understand these techniques it is important to examine different architectures on different datasets. Kalchbrenner et al. (2018) found sparse GRUs (Cho et al., 2014) to be very effective at modeling speech, however the dataset they used is not available. We choose a proxy task with similar characteristics (dataset size and vocabulary size are approximately the same) - character level language modeling on the publicly available WikiText-103 (Merity et al., 2016) dataset.
Our network consists of a shared embedding with dimensionality 128, a vocabulary size of 256, a GRU with a state size of 512, a readout from the GRU state consisting of two linear layers with 256 units and 128 units respectively. We train the next step prediction task with the standard cross entropy loss, the Adam optimizer, a learning rate of , an L2 regularization coefficient of , a sequence length of 512, a batch size of 32 and gradient absolute value clipping of values larger (in magnitude) than 10. We set the sparsity to 75% for all models and run 200,000 iterations. When inducing sparsity with magnitude pruning (Zhu and Gupta, 2018), we perform pruning between iterations 50,000 and 150,000 with a frequency of 1,000. We initialize sparse networks with a uniform sparsity distribution and use a cosine update schedule with and . Unlike the previous experiments we keep updating the mask until the end of the training; we observed this performed slightly better than stopping at iteration 150,000.
In Figure 4-left we report the validation bits per step of various solutions at the end of the training. For each method we perform extended runs to see how they scale with increasing training time. As observed before, SET performs worse than the other dynamic training methods and its performance improves only slightly with increased training time. On the other hand the performance of RigL and SNFS improves constantly with more training steps. Even though RigL exceeds the performance of the other sparse training approaches it fails to match the performance of pruning in this setting.
4.3 WideResNet-22-2 on CIFAR-10
We also evaluate the performance of RigL on the CIFAR-10 image classification benchmark. We train a Wide Residual Network (Zagoruyko and Komodakis, 2016) with 22 layers using a width multiplier of 2 for 250 epochs (97656 steps). The learning rate starts at 0.1 which is scaled down by a factor of 5 every 30,000 iterations. We use an L2 regularization coefficient of 5e-4, a batch size of 128 and a momentum coefficient of 0.9. We keep the hyper-parameters specific to RigL same as the ImageNet experiments, except the final iteration for mask updates; which is adjusted to 75000. Results with different mask update intervals can be found in Appendix I.
The final accuracy of RigL for various sparsity levels is presented in Figure 4-right. The dense baseline obtains 94.1% test accuracy; surprisingly, some of the 50% sparse networks generalize better than the dense baseline demonstrating the regularization aspect of sparsity. With increased sparsity, we see a performance gap between the Static and Pruning solutions. Training static networks longer seems to have limited effect on the final performance. On the other hand, RigL matches the performance of pruning with only a fraction of the resources needed for training.
4.4 Analyzing the performance of RigL
In this section we study the effect of sparsity distributions, update schedules, and dynamic connections on the performance of our method. The results for SET and SNFS are similar and are discussed in Appendices C and F.
Effect of Mask Initialization: Figure 5-left shows how the sparsity distribution affects the final test accuracy of sparse ResNet-50s trained with RigL. Erdős-Rényi-Kernel (ERK) performs consistently better than the other two distributions. ERK automatically allocates more parameters to the layers with few parameters by decreasing their sparsities†† † see Appendix J for exact layer-wise sparsities given by ERK for 80% and 90% sparse ResNet-50’s.. This reallocation seems to be crucial for preserving the capacity of the network at high sparsity levels where ERK outperforms other distributions by a greater margin. Though it performs better, the ERK distribution requires approximately twice as many FLOPs compared to a uniform distribution. This highlights an interesting trade-off between accuracy and computational efficiency even though both models have the same number of parameters.
Effect of Update Schedule and Frequency: In Figure 5-right, we evaluate the performance of our method on update intervals and initial drop fractions . The best accuracies are obtained when the mask is updated every 100 iterations with an initial drop fraction of 0.3 or 0.5. Notably, even with frequent update intervals (e.g. every 1000 iterations), RigL performs above 73.5%.
Effect of Dynamic connections: Frankle et al. (2019) and Mostafa and Wang (2019) observed that static sparse training converges to a solution with a higher loss than dynamic sparse training. In Figure 6-left we examine the loss landscape lying between a solution found via static sparse training and a solution found via pruning to understand whether the former lies in a basin isolated from the latter. Performing a linear interpolation between the two reveals the expected result – a high-loss barrier – demonstrating that the loss landscape is not trivially connected. However, this is only one of infinitely many paths between the two points; optimization can be used to find parametric curves that connect solutions (Garipov et al., 2018; Draxler et al., 2018) subject to constraints, such as minimizing the loss along the curve. For example Garipov et al. (2018) showed different dense solutions lie in the same basin by finding order Bézier curves with low energy between the two solutions. Following their method, we attempt to find quadratic and cubic Bézier curves between the two sparse solutions. Surprisingly, even with a cubic curve, we fail to find a path without a high-loss barrier. These results suggest that static sparse training can get stuck at local minima that are isolated from improved solutions. On the other hand, when we optimize the quadratic Bézier curve across the full dense space we find a near-monotonic path to the improved solution, suggesting that allowing new connections to grow lends dynamic sparse training greater flexibility in navigating the loss landscape.
In Figure 6-right we train RigL starting from the sub-optimal solution found by static sparse training, demonstrating that it is able to escape the local minimum, whereas re-training with static sparse training cannot. RigL first removes connections with the smallest magnitudes since removing these connections have been shown to have a minimal effect on the loss (Han et al., 2015; Evci, 2018). Next, it activates connections with the high gradients, since these connections are expected to decrease the loss fastest. We hypothesize in Appendix A that RigL escapes bad critical points by replacing saddle directions with high gradient dimensions.
5 Discussion & Conclusion
In this work we introduced ‘Rigged Lottery’ or RigL, an algorithm for training sparse neural networks efficiently. For a given computational budget RigL achieves higher accuracies than existing dense-to-sparse and sparse-to-sparse training algorithms. RigL is useful in three different scenarios: (1) To improve the accuracy of sparse models intended for deployment; (2) To improve the accuracy of large sparse models which can only be trained for a limited number of iterations; (3) Combined with sparse primitives to enable training of extremely large sparse models which otherwise would not be possible.
The third scenario is unexplored due to the lack of hardware and software support for sparsity. Nonetheless, work continues to improve the performance of sparse networks on current hardware (Hong et al., 2019; Merrill and Garland, 2016), and new types of hardware accelerators will have better support for parameter sparsity (Wang et al., 2018; Ashby, 2019; Liu et al., 2018; Han et al., 2016a; Chen et al., 2019). RigL provides the tools to take advantage of, and motivation for, such advances.
- TensorFlow: large-scale machine learning on heterogeneous systems. External Links: Cited by: §4.
- Exploiting unstructured sparsity on next-generation datacenter hardware. External Links: Cited by: §5.
- Deep rewiring: training very sparse deep networks. In International Conference on Learning Representations, Cited by: §2, §4.2.
- Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (2), pp. 292–308. External Links: Cited by: §5.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), (English (US)). Cited by: §4.2.
- Learning sparse neural networks through regularization. In International Conference on Learning Representations, Cited by: Appendix B, §2.
- Compressing neural networks using the variational information bottleneck. In International Conference on Machine Learning, Cited by: Appendix B, Appendix B.
- Sparse networks from scratch: faster training without losing performance. ArXiv. External Links: Cited by: Appendix G, 1st item, §2, §3.
- \hrefhttp://proceedings.mlr.press/v80/draxler18a/draxler18a.pdfEssentially No Barriers in Neural Network Energy Landscape. In International Conference on Machine Learning, Cited by: §4.4.
- Fast sparse convnets. External Links: Cited by: §1.
- The difficulty of training sparse neural networks. ArXiv. External Links: Cited by: §4.1.1.
- Detecting dead weights and units in neural networks. CoRR. External Links: Cited by: Appendix A, §4.4.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Cited by: §1, §2.
- The lottery ticket hypothesis at scale. ArXiv. External Links: Cited by: §2, §4.1.1, §4.4.
- The state of sparsity in deep neural networks. CoRR abs/1902.09574. External Links: Cited by: §1, §2, §4.1.1, §4.1.1, §4.1.1.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, Cited by: §4.4.
- DARPA timit acoustic phonetic continuous speech corpus cdrom. NIST. Cited by: §4.2.
- Dynamic network surgery for efficient DNNs. CoRR abs/1608.04493. External Links: Cited by: Rigging the Lottery: Making All Tickets Winners, §1, §2.
- EIE: efficient Inference Engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, Cited by: §5.
- Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Cited by: §2.
- Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, Cited by: Appendix A, §1, §4.4.
- Second order derivatives for network pruning: Optimal Brain Surgeon. Cited by: §2.
- Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
- Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: §4.2.
- Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP ’19, New York, NY, USA, pp. 300–314. External Links: Cited by: §5.
- MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Cited by: §4.1.2.
- Efficient neural audio synthesis. In International Conference on Machine Learning (ICML), Cited by: §1, §4.2.
-  () CIFAR-10 (canadian institute for advanced research). . External Links: Cited by: §4.
- Optimal Brain Damage. In Advances in Neural Information Processing Systems, Cited by: §2.
- SNIP: Single-shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations (ICLR), 2019, Cited by: §2.
- Pruning filters for efficient convnets. In International Conference on Learning Representations, Cited by: Rigging the Lottery: Making All Tickets Winners.
- Memory-efficient deep learning on a spinnaker 2 prototype. In Front. Neurosci., Cited by: §5.
- Rethinking the value of network pruning. In International Conference on Learning Representations, Cited by: Appendix B, Appendix B.
- Pointer sentinel mixture models. ArXiv. External Links: Cited by: §4.2, §4.
- Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’16, New York, NY, USA, pp. 43:1–43:2. External Links: Cited by: §5.
- \hrefhttp://www.nature.com/articles/s41467-018-04316-3Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications. Cited by: §2, item 2.
- Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 2498–2507. Cited by: Rigging the Lottery: Making All Tickets Winners, §1, §2.
- Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning. CoRR abs/1611.06440. Cited by: Appendix A.
- Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 4646–4655. External Links: Cited by: §2, §4.1.1, §4.4.
- Exploring sparsity in recurrent neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: Appendix A, Rigging the Lottery: Making All Tickets Winners, §2.
- Structured bayesian pruning via log-normal multiplicative noise. In International Conference on Neural Information Processing Systems, Cited by: Appendix B.
- Holistic SparseCNN: forging the trident of accuracy, speed, and size. CoRR abs/1608.01409. External Links: Cited by: §1.
- ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV). Cited by: §4.
- MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4510–4520. External Links: Cited by: §4.1.2.
- Training sparse neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 455–462. External Links: Cited by: §1.
- Sparse Connection and Pruning in Large Dynamic Artificial Neural Networks. In EUROSPEECH, Cited by: Appendix A, §2.
- Rethinking the inception architecture for computer vision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,, External Links: Cited by: §4.1.
- Learning sparse neural networks via sensitivity-driven regularization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 3882–3892. External Links: Cited by: §2.
- Evaluating pruning methods. In International Symposium on Artificial Neural Networks, pp. 2. Cited by: Appendix A, §2.
- SNrram: an efficient sparse neural network computation architecture based on resistive random-access memory. In Proceedings of the 55th Annual Design Automation Conference, DAC ’18, New York, NY, USA, pp. 106:1–106:6. External Links: Cited by: §5.
- Wide residual networks. In BMVC, Cited by: §4.3.
- \hrefhttp://arxiv.org/abs/1905.01067Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. ArXiv. Cited by: §2.
- \hrefhttps://arxiv.org/abs/1710.01878To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. In International Conference on Learning Representations Workshop, Cited by: item 2, Rigging the Lottery: Making All Tickets Winners, §1, §2, Figure 3, §4.1.2, §4.2, §4.
Appendix A Effect of Mask Updates on the Energy Landscape
To update the connectivity of our sparse network, we first need to drop a fraction of the existing connections for each layer independently to create a budget for growing new connections. Like many prior works (Thimm and Fiesler, 1995; Ström, 1997; Narang et al., 2017; Han et al., 2015), we drop parameters with the smallest magnitude. The effectiveness of this simple criteria can be explained through the first order Taylor approximation of the loss around the current set of parameters .
The main goal of dropping connections is to remove parameters with minimal impact on the output of the neural network and therefore on its loss. Since removing the connection corresponds to setting it to zero, it incurs a change of in that direction and a change of in the loss, where the first term is usually defined as the saliency of a connection. Saliency has been used as a criterion to remove connections (Molchanov et al., 2016), however it has been shown to produce inferior results compared to magnitude based removal, especially when used to remove multiple connections at once (Evci, 2018). In contrast, picking the lowest magnitude connections ensures a small remainder term in addition to a low saliency, limiting the damage we make when we drop connections. Additionally, we note that connections with small magnitude can only remain small if the gradient is also small, meaning that the saliency is likely small when the parameter itself is small.
After the removal of insignificant connections, we enable new connections that have the highest expected gradients. Since we initialize these new connections to zero, they are guaranteed to have high gradients in the proceeding iteration and therefore to reduce the loss quickly. Combining this observation with the previous one (RigL is likely to remove low gradient directions) and the results in Section 4.4 we suggest that RigL improves the energy landscape of the optimization by replacing flat dimensions with the ones with high gradient. This helps the optimization procedure escape saddle points.
Appendix B Comparison with Bayesian Structured Pruning Algorithms
Structured pruning algorithms aim to remove entire neurons (or channels) instead of individual connections either at the end of, or throughout training. The final pruned network is a smaller dense network. Liu et al. (2019) demonstrated that these smaller networks could themselves be successfully be trained from scratch. This recasts structured pruning approaches as a limited kind of architecture search, where the search space is the size of each hidden layer.
In this section we compare RigL with three different structured pruning algorithms: SBP (Neklyudov et al., 2017), L0 (Christos Louizos, 2018), and VIB (Dai et al., 2018). We show that starting from a random sparse network, RigL finds compact networks with fewer parameters, that require fewer FLOPs for inference and require fewer resources for training. This serves as a general demonstration of the effectiveness of unstructured sparsity.
|Method||Final||Sparsity||Training Cost||Inference Cost||Size||Error|
For our setting we pick the standard LeNet 300-100 network with ReLU non-linearities trained on MNIST. In Table 3 we compare methods based on how many FLOPs they require for training and also how efficient the final architecture is. Unfortunately, none of the papers have released the code for reproducing the MLP results, so we therefore use the reported accuracies and calculate lower bounds for the the FLOPs used during training. For each method we assume that one training step takes as much as the dense 300-100 architecture and omit any additional operations each method introduces. We also consider training the pruned networks from scratch and report the training FLOPs required in parenthesis. In this setting, training FLOPs are significantly lower since the starting networks are have been significantly reduced in size. We assume that following (Liu et al., 2019) the final networks can be trained from scratch, but we cannot verify this for these MLP networks since it would require knowledge of which pixels were dropped from the input.
To compare, we train a sparse network starting from the original MLP architecture (RigL). At initialization, we randomly remove 99% and 89% of the connections in the first and second layer of the MLP. At the end of the training many of the neurons in the first 2 layers have no in-coming or out-going connections. We remove such neurons and use the resulting architecture to calculate the inference FLOPs and the size. We assume the sparse connectivity is stored as a bit-mask (We assume parameters are represented as floats, i.e. 4 bytes). In this setting, RigL finds smaller, more FLOP efficient networks with far less work than the Bayesian approaches.
Next, we train a sparse network starting from the architecture found by the first run (RigL+) (408-100-69) but with a new random initialization (both masks and the parameters). We reduce the sparsity of the first 2 layers to 96% and 86% respectively as the network is already much smaller. Repeating RigL training results in an even more compact architecture half the size and requiring only a third the FLOPs of the best architecture found by Dai et al. (2018).
Examination of the open-sourced code for the methods considered here made us aware that all of them keep track of the test error during training and report the best error ever observed during training as the final error. We generally would not encourage such overfitting to the test/validation set, however to make the comparisons with these results fair we report both the lowest error observed during training and the error at the end of training (reported in parenthesis). All hyper-parameter tuning was done using only the final test error.
In Figure 7 we visualize how RigL chooses to connect to the input and how this evolves from the beginning to the end of training. The heatmap shows the number of outgoing connections from each input pixels at the beginning (RigL Initial) and at the end (RigL (Final)) of the training. The left two images are for the initial network and the right two images are for RigL+ training. RigL automatically discards uninformative pixels and allocates the connections towards the center highlighting the potential of RigL on model compression and feature selection.
Appendix C Effect of Sparsity Distribution on Other Methods
In Figure 8-left we show the effect of sparsity distribution choice on 4 different sparse training methods. ERK distribution performs better than other distributions for each training method.
Appendix D Effect of Momentum Coefficient for SNFS
In Figure 8 right we show the effect of the momentum coefficient on the performance of SNFS. Our results shows that using a coefficient of 0.99 brings the best performance. On the other hand using the most recent gradient only (coefficient of 0) performs as good as using a coefficient of 0.9. This result might be due to the large batch size we are using (4096), but it still motivates using RigL and instantaneous gradient information only when needed, instead of accumulating them.
Appendix E (Non)-Existence of Lottery Tickets
We perform the following experiment to see whether Lottery Tickets exist in our setting. We take the sparse network found by RigL and restart training using original initialization, both with RigL and with fixed topology as in the original Lottery Ticket Hypothesis. Results in table 4 demonstrate that training with a fixed topology is significantly worse than training with RigL and that RigL does not benefit from starting again with the final topology and the original initialization - training for twice as long instead of rewiring is more effective. In short, there are no special tickets, with RigL all tickets seems to win.
|Initialization||Training Method||Test Accuracy||Training FLOPs|
Appendix F Effect of Update Schedules on Other Dynamic Sparse Methods
In Figure 9 we repeat the hyper-parameter sweep done for RigL in Figure 5-right, using SET and SNFS. Cosine schedule with and seems to work best across all methods. An interesting observation is that higher drop fractions () seem to work better with longer intervals . For example, SET with seems to work best with .
Appendix G Alternative Update Schedules
In Figure 10, we share the performance of two alternative annealing functions:
Inverse Power: The fraction of weights updated decreases similarly to the schedule used in Zhu and Gupta (2018) for iterative pruning: . In our experiments we tried which is the linear decay and their default .
Constant seems to perform well with low initial drop fractions like , but it starts to perform worse with increasing . Inverse Power for k=3 and k=1 (Linear) seems to perform similarly for low values. However the performance drops noticeably for k=3 when we increase the update interval. As reported by Dettmers and Zettlemoyer (2019) linear (k=1) seems to provide similar results as the cosine schedule.
Appendix H Calculating FLOPs of models and methods
In order to calculate FLOPs needed for a single forward pass of a sparse model, we count the total number of multiplications and additions layer by layer for a given layer sparsity . The total FLOPs is then obtained by summing up all of these multiply and adds.
Different sparsity distributions require different number of FLOPs to compute a single prediction. For example Erdős-Renyi-Kernel distributions usually cause earlier layers to be less sparse than the later layers (see Appendix J). The inputs of earlier layers have greater spatial dimensions, so a convolutional kernel that works on such inputs will require more FLOPs to compute the output features compared to later layers. Thus, having earlier layers which are less sparse results in a higher total number of FLOPs required by a model.
Training a neural network consists of 2 main steps:
forward pass: Calculating the loss of the current set of parameters on a given batch of data. During this process layer activations are calculated in sequence using the previous activations and the parameters of the layer. Activation of layers are stored in memory for the backward pass.
backward pass: Using the loss value as the initial error signal, we back-propagate the error signal while calculating the gradient of parameters. During the backward pass each layer calculates 2 quantities: the gradient of the activations of the previous layer and the gradient of its parameters. Therefore in our calculations we count backward passes as two times the computational expense of the forward pass. We omit the FLOPs needed for batch normalization and cross entropy.
Dynamic sparse training methods require some extra FLOPs to update the connectivity of the neural network. We omit FLOPs needed for dropping the lowest magnitude connections in our calculations. For a given dense architecture with FLOPs and a sparse version with FLOPs , the total FLOPs required to calculate the gradient on a single sample is computed as follows:
Static Sparse and Dense. Scales with and FLOPs, respectively.
Snip. We omit the initial dense gradient calculation since it is negligible, which means Snip scales in the same way as Static methods: FLOPs.
SET. We omit the extra FLOPs needed for growing random connections, since this operation can be done on chip efficiently. Therefore, the total FLOPs for SET scales with .
SNFS. Forward pass and back-propagating the error signal needs FLOPs. However, the dense gradient needs to be calculated at every iteration. Thus, the total number of FLOPs scales with .
RigL. Iterations with no connection updates need FLOPs. However, at every iteration we need to calculate the dense gradients. This results in the average FLOPs for RigL given by .
Appendix I Additional Plots and Experiments for CIFAR-10
In Figure 11-left, we plot the final training loss of experiments presented in Section 4.3 to investigate the generalization properties of the algorithms considered. Poor performance of Static reflects itself in training loss clearly across all sparsity levels. RigL achieves similar final loss as the pruning, despite having around half percent less accuracy. Training longer with RigL decreases the final loss further and the test accuracies start matching pruning (see Figure 4-right) performance. These results show that RigL improves the optimization as promised, however generalizes slightly worse than pruning.
In Figure 11-right, we sweep mask update interval and plot the final test accuracies. We fix initial drop fraction to and evaluate two different sparsity distributions: Uniform and ERK. Both curves follow a similar pattern as in Imagenet-2012 sweeps (see Figure 9) and best results are obtained when .
Appendix J Sparsity of Individual Layers for Sparse ResNet-50
Sparsity of ResNet-50 layers given by the Erdős-Rényi-Kernel sparsity distribution plotted in Figure 12.
Appendix K Bugs Discovered During Experiments
Our initial implementations contained some subtle bugs, which while not affecting the general conclusion that RigL is more effective than other techniques, did result in lower accuracy for all sparse training techniques. We detail these issues here with the hope that others may learn from our mistakes.
Random operations on multiple replicas. We use data parallelism to split a mini-batch among multiple replicas. Each replica independently calculates the gradients using a different sub-mini-batch of data. The gradients are aggregated using an all-reduce operation before the optimizer update. Our implementation of SET, SNFS and RigL depended on each replica independently choosing to drop and grow the same connections. However, due to the nature of random operations in Tensorflow, this did not happen. Instead, different replicas diverged after the first drop/grow step. This was most pronounced in SET where each replica chose at random and much less so for SNFS and RigL where randomness is only needed to break ties. If left unchecked this might be expected to be catastrophic, but due to the behavior of Estimators and/or TF-replicator, the values on the first replica are broadcast to the others periodically (every approximately 1000 steps in our case).
Synchronization between replicas. RigL and SNFS depend on calculating dense gradients with respect to the masked parameters. However, as explained above, in the multiple replica setting these gradients need to be aggregated. Normally this aggregation is automatically done by the optimizer, but in our case, this does not happen (only the gradients with respect to the unmasked parameters are aggregated automatically). This bug affected SNFS and RigL, but not SET since SET does not rely on the gradients to grow connections. Again, the synchronization of the parameters from the first replica every approximately 1000 steps masked this bug.
We fixed this bug by explicitly calling all-reduce on the gradients with respect to the masked parameters. With this fix, the performance of RigL and SNFS improved significantly, particularly for default training lengths (around 0.5-1% improvement).
SNIP Experiments. Our first implementation of SNIP used the gradient magnitudes to decide which connections to keep causing its performance to be worse than static. Upon our discussions with the authors of SNIP, we realized that the correct metric is the saliency (gradient times parameter magnitude). With this correction SNIP performance improved dramatically to better than random (Static) even at Resnet-50/ImageNet scale. It is surprising that picking connections with the highest gradient magnitudes can be so detrimental to training (it resulted in much worse than random performance).