Sparse Networks from Scratch: Faster Training without Losing Performance

  • 2019-07-10 17:40:20
  • Tim Dettmers, Luke Zettlemoyer
  • 36


We demonstrate the possibility of what we call sparse learning: acceleratedtraining of deep neural networks that maintain sparse weights throughouttraining while achieving performance levels competitive with dense networks. Weaccomplish this by developing sparse momentum, an algorithm which usesexponentially smoothed gradients (momentum) to identify layers and weightswhich reduce the error efficiently. Sparse momentum redistributes prunedweights across layers according to the mean momentum magnitude of each layer.Within a layer, sparse momentum grows weights according to the momentummagnitude of zero-valued weights. We demonstrate state-of-the-art sparseperformance on MNIST, CIFAR-10, and ImageNet, decreasing the mean error by arelative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, weshow that our algorithm can reliably find the equivalent of winning lotterytickets from random initialization: Our algorithm finds sparse configurationswith 20% or fewer weights which perform as well, or better than their densecounterparts. Sparse momentum also decreases the training time: It requires asingle training run -- no re-training is required -- and increases trainingspeed up to 11.85x. In our analysis, we show that our sparse networks might beable to reach dense performance levels by learning more general features whichare useful to a broader range of classes than dense networks.


Quick Read (beta)

Sparse Networks from Scratch:
Faster Training without Losing Performance

Tim Dettmers & Luke Zettlemoyer
University of Washington
{dettmers, lsz}

We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving performance levels competitive with dense networks. We accomplish this by developing sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently. Sparse momentum redistributes pruned weights across layers according to the mean momentum magnitude of each layer. Within a layer, sparse momentum grows weights according to the momentum magnitude of zero-valued weights. We demonstrate state-of-the-art sparse performance on MNIST, CIFAR-10, and ImageNet, decreasing the mean error by a relative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, we show that our algorithm can reliably find the equivalent of winning lottery tickets from random initialization: Our algorithm finds sparse configurations with 20% or fewer weights which perform as well, or better than their dense counterparts. Sparse momentum also decreases the training time: It requires a single training run — no re-training is required — and increases training speed up to 11.85x. In our analysis, we show that our sparse networks might be able to reach dense performance levels by learning more general features which are useful to a broader range of classes than dense networks.


Sparse Networks from Scratch:
Faster Training without Losing Performance

  Tim Dettmers & Luke Zettlemoyer University of Washington {dettmers, lsz}


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Current state-of-the-art neural networks need extensive computational resources to be trained and can have capacities of close to one billion connections between neurons (Vaswani et al., 2017; Devlin et al., 2018; Child et al., 2019). One solution that nature found to improve neural network scaling is to use sparsity: the more neurons a brain has, the fewer connections neurons make with each other (Herculano-Houzel et al., 2010). Similarly, for deep neural networks, it has been shown that sparse weight configurations exist which train faster and achieve the same errors as dense networks (Frankle and Carbin, 2019). However, currently, these sparse configurations are found by starting from a dense network, which is pruned and re-trained repeatedly – an expensive procedure.

In this work, we demonstrate the possibility of training sparse networks that rival the performance of their dense counterparts with a single training run – no re-training is required. We train from random initializations and maintain sparse weights throughout training while also speeding up the overall training time. We achieve this by developing sparse momentum, an algorithm which uses the exponentially smoothed gradient of network weights (momentum) as a measure of persistent errors to identify which layers are most efficient at reducing the error and which missing connections between neurons would reduce the error the most. Sparse momentum follows a cycle of (1) pruning weights with small magnitude, (2) redistributing weights across layers according to the mean momentum magnitude of existing weights, and (3) growing new weights to fill in missing connections which have the highest momentum magnitude.

We compare the performance of sparse momentum to compression algorithms and recent methods that maintain sparse weights throughout training. We demonstrate state-of-the-art sparse performance on MNIST, CIFAR-10, and ImageNet-2012. Sparse momentum also matches the performance of several dense baselines on MNIST and CIFAR-10. We estimate mean speedups of our sparse convolutional networks on CIFAR-10 for optimal sparse convolution algorithms and naive dense convolution algorithms compared to dense baselines. For sparse convolution, we estimate speedups between 3.50x and 11.85x and for dense convolution speedups between 1.16x and 1.45x. Finally, we present an analysis of the feature representations of our sparse networks. We find that networks trained by sparse momentum learn more general features which are useful to a broader range of classes than dense features which might explain why sparse networks can compete with dense networks.

2 Related Work

From Dense to Sparse Neural Networks: Work that focuses on creating sparse from dense neural networks has an extensive history. Earlier work focused on pruning via second-order derivatives (LeCun et al., 1989; Karnin, 1990; Hassibi and Stork, 1992) and heuristics which ensure efficient training of networks after pruning (Chauvin, 1988; Mozer and Smolensky, 1988; Ishikawa, 1996). Recent work is often motivated by the memory and computational benefits of sparse models that enable the deployment of deep neural networks on mobile and low-energy devices. A very influential paradigm has been the iterative (1) train-dense, (2) prune, (3) re-train cycle introduced by Han et al. (2015). Extensions to this work include: Compressing recurrent neural networks and other models (Narang et al., 2017; Zhu and Gupta, 2018; Dai et al., 2018), continuous pruning and re-training (Guo et al., 2016), joint loss/pruning-cost optimization (Carreira-Perpinán and Idelbayev, 2018), layer-by-layer pruning (Dong et al., 2017), fast-switching growth-pruning cycles (Dai et al., 2017), and soft weight-sharing (Ullrich et al., 2017). These approaches often involve re-training phases which increase the training time. However, since the main goal of this line of work is a compressed model for mobile devices, it is desirable but not an important main goal to reduce the run-time of these procedures. This is contrary to our motivation. Despite the difference in motivation, we include many of these dense-to-sparse compression methods in our comparisons. Other compression algorithms include L0 regularization (Louizos et al., 2018), and Bayesian methods (Louizos et al., 2017; Molchanov et al., 2017). For further details, see the survey of Gale et al. (2019).

Interpretation and Analysis of Sparse Neural Networks: Frankle and Carbin (2019) show that "winning lottery tickets" exist for deep neural networks – sparse initializations which reach similar predictive performance as dense networks and train just as fast. However, finding these winning lottery tickets is computationally expensive and involves multiple prune and re-train cycles starting from a dense network. Followup work concentrated on finding these configurations faster (Frankle et al., 2019; Zhou et al., 2019). In contrast, we reach dense performance levels with a sparse network from random initialization with a single training run while accelerating training.

Sparse Neural Networks Throughout Training: Methods that maintain sparse weights throughout training through a prune-redistribute-regrowth cycle are most closely related to our work. Bellec et al. (2018) introduce DEEP-R, which takes a Bayesian perspective and performs sampling for prune and regrowth decisions – sampling sparse network configurations from a posterior. While theoretically rigorous, this approach is computationally expensive and challenging to apply to large networks and datasets. Sparse evolutionary training (SET) (Mocanu et al., 2018) simplifies prune-regrowth cycles by using heuristics: (1) prune the smallest and most negative weights, (2) grow new weights in random locations. Unlike our work, where many convolutional channels are empty and can be excluded from computation, growing weights randomly fills most convolutional channels and makes it challenging to harness computational speedups during training without specialized sparse algorithms. SET also does not include the cross-layer redistribution of weights which we find to be critical for good performance, as shown in our ablation study. The most closely related work to ours is Dynamic Sparse Reparameterization (DSR) by Mostafa and Wang (2019), which includes the full prune-redistribute-regrowth cycle. However, similar to SET, DSR includes random regrowth which hampers the possibilities of speedups during training. More distantly related is Single-shot Network Pruning (SNIP) (Lee et al., 2019), which aims to find the best sparse network from a single pruning decision. The goal of SNIP is simplicity, while our goal is maximizing predictive and run-time performance. In our work, we compare against all four methods: DEEP-R, SET, DSR, and SNIP.

3 Method

3.1 Sparse Learning

We define sparse learning to be the training of deep neural networks which maintain sparsity throughout training while matching the predictive performance of dense neural networks. To achieve this, intuitively, we want to find the weights that reduce the error most effectively. This is challenging since most deep neural network can hold trillions of different combinations of sparse weights. Additionally, during training, as feature hierarchies are learned, efficient weights might change gradually from shallow to deep layers. How can we find good sparse configurations? In this work, we follow a divide-and-conquer strategy that is guided by computationally efficient heuristics. We divide sparse learning into the following sub-problems which can be tackled independently: (1) Pruning weights, (2) redistribution of weights across layers, and (3) regrowing weights, as defined in more detail below.

Figure 1: Sparse Momentum is applied at the end of each epoch: (1) take the magnitude of the exponentially smoothed gradient (momentum) of each layer and normalize to 1; (2) for each layer, remove p=50% of the weights with the smallest magnitude; (3) across layers, redistribute the removed weights by adding weights to each layer proportionate to the momentum of each layer; within a layer, add weights starting from those with the largest momentum magnitude. Decay p.

3.2 Sparse Momentum

We use the mean magnitude of momentum 𝐌i of existing weights 𝐖i in each layer i to estimate how efficient the average weight in each layer is at reducing the overall error. Intuitively, we want to take weights from less efficient layers and redistribute them to weight-efficient layers. The sparse momentum algorithm is depicted in Figure 1. In this section, we first describe the intuition behind sparse momentum and then present a more detailed description of the algorithm.

The gradient of the error with respect to a weight 𝐄𝐖 yields the directions which reduce the error at the highest rate. However, if we use stochastic gradient descent, most weights of 𝐄𝐖 oscillate between small/large and negative/positive gradients with each mini-batch (Qian, 1999) – a good change for one mini-batch might be a bad change for another. We can reduce oscillations if we take the average gradient over time, thereby finding weights which reduce the error consistently. However, we want to value recent gradients, which are closer to the local minimum, more highly than the distant past. This can be achieved by exponentially smoothing 𝐄𝐖 – the momentum 𝐌i:


where α is a smoothing factor, 𝐌i is the momentum for the weight 𝐖i in layer i; 𝐌i is initialized with 𝟎.

Momentum is efficient at accelerating the optimization of deep neural networks by identifying weights which reduce the error consistently. Similarly, the aggregated momentum of weights in each layer should reflect how good each layer is at reducing the error consistently. Additionally, the momentum of zero-valued weights – equivalent to missing weights in sparse networks – can be used to estimate how quickly the error would change if these weights would be included in a sparse network.

The details of the algorithm are shown in Algorithm 1. Before training, we initialize the network with a certain sparsity s: We initialize the network as usual and then remove a fraction of s weights for each layer. During training, we apply sparse momentum after each epoch. We can break the sparse momentum algorithm itself in three major parts: (a) redistribution of weights, (b) pruning weights, (c) regrowing weights. In step (a), we calculate the weight redistribution proportions and in turn how many weights to regrow in each layer: For each layer, we take the mean of the element-wise momentum magnitude that belongs to all nonzero weights. We then sum-normalize these means across all layers to get the momentum contribution of each layer. Finally, we take this momentum contribution for each layer and multiply with the overall removed weights to get the number of weights which we will regrow in each layer. In step (b), we prune a proportion of p (pruning rate) of the weights with the lowest magnitude for each layer. In step (c), we regrow weights by enabling the gradient flow of zero-valued (missing) weights which have the largest momentum magnitude.

Additionally, there are two edge-cases which we did not include in Algorithm 1 for clarity: (1) If we allocate more weights to be regrown than is possible for a specific layer, for example regrowing 100 weights for a layer of maximum 10 weights, we redistribute the excess number of weights equally among all other layers. (2) If a layer i is dense and still growing we reduce the pruning rate pi for these layers proportional to the sparsity: pi=min(p,sparsityi).

After each epoch, we decay the pruning rate in Algorithm 1 in the same way learning rates are decayed. We find that a cosine decay schedule that anneals the pruning rate to zero on the last epoch yields the best validation error and we use this procedure for all experiments.

\SetAlgoLined\DontPrintSemicolon\KwDataLayer i to k with: Momentum 𝐌i, Weight 𝐖𝐢, binary Maski; pruning rate p TotalMomentum0,WTotalNonzero0\[email protected]
\tcc(a) Calculate mean momentum contributions of all layers. \Fori0 \KwTok MeanMomentumimean(abs(𝐌i[𝐖i0]))\[email protected]
TotalMomentumTotalMomentum+MeanMomentumi\[email protected]
NonZeroi=sum(𝐖i0)\[email protected]
TotalNonzeroTotalNonzero+NonZeroi\[email protected]
\Fori0 \KwTok LayerContributioniMeanMomentumi/TotalMomentum\[email protected]
pigetPruneRate(𝐖i,p)\[email protected]
NumRegrowthifloor(piTotalNonzeroLayerContributioni) \tcc(b) Prune weights by finding the NumRemoveth smallest weight. \Fori0 \KwTok NumRemoveiNonZeroip\[email protected]
PruneThresholdsort(abs(𝐖i[𝐖i0]))[NumRemovei]\[email protected]
𝐌𝐚𝐬𝐤i[𝐖i<PruneThreshold]0 \tcpStop gradient flow. 𝐖i[𝐖i<PruneThreshold]0\[email protected]
\tcc(c) Enable gradient flow of weights with largest momentum magnitude. \Fori0 \KwTok RegrowthThresholdisort(abs(𝐌i[𝐖i==0]))[NumRegrowthi]\[email protected]
𝐙i=𝐌i(𝐖i==0) \tcpOnly consider the momentum of missing weights. 𝐌𝐚𝐬𝐤i𝐌𝐚𝐬𝐤i|(𝐙i>RegrowthThresholdi) \tcp| is the boolean OR operator pdecayPruneRate(p)\[email protected]
applyMask()\[email protected]
\algorithmcfname 1 Sparse momentum algorithm in NumPy notation.

3.3 Experimental Setup

For comparison, we follow two different experimental settings from Lee et al. (2019) and Mostafa and Wang (2019): For MNIST (LeCun, 1998), we use a batch size of 100, decay the learning rate by a factor of 0.1 every 25000 mini-batches. For CIFAR-10 (Krizhevsky and Hinton, 2009), we use standard data augmentations (horizontal flip, and random crop with reflective padding), a batch size of 128, and decay the learning rate every 30000 mini-batches. We train for 100 and 250 epochs on MNIST and CIFAR-10, use a learning rate of 0.1, stochastic gradient descent with Nesterov momentum of 0.9, and we use a weight decay of 0.0005. We use a fixed 10% of the training data as the validation set and train on the remaining 90%. We evaluate the test set performance of our models on the last epoch. For all experiments on MNIST and CIFAR-10, we report the standard errors. Our sample size is generally between 10 and 12 experiments per method/architecture/sparsity level with different random seeds for each experiment.

We use the modified network architectures of AlexNet, VGG16, and LeNet-5 as introduced by Lee et al. (2019). For the setup of Mostafa and Wang (2019) we use no validation set and for Wide Residual Networks (WRN) 28-2 (Zagoruyko and Komodakis, 2016) experiments on CIFAR-10 we start with the following layers as dense: First convolutional layer, last fully connected layer, and all downsample residual convolutional layers.

On ImageNet (Deng et al., 2009), we use ResNet-50 (He et al., 2016) with a stride of 2 for the 3x3 convolution in the bottleneck layers. We use a batch size of 256, input size of 224, momentum of 0.9, and weight decay of 10-4. We train for 100 epochs and report validation set performance after the last epoch.

For all experiments, we keep biases and batch normalization weights dense. We additionally tune a single parameter: The initial pruning rate p. We search in the space {0.2, 0.3, 0.4, 0.5, 0.6, 0.7} and find that for most networks on MNIST and CIFAR-10 a pruning rate of p=0.5 works best. We use this pruning rate throughout all experiments.

ImageNet experiments were run on 4x RTX 2080 Ti and all other experiments on individual GPUs.

Our software builds on PyTorch (Paszke et al., 2017) and is a wrapper for PyTorch neural networks with a modular architecture for growth, redistribution, and pruning algorithms. Using our software, any PyTorch neural network can be adapted to be a sparse momentum network with 5 lines of code. We will open-source our software along with trained models and individual experimental results.11 1

Figure 2: Test set accuracy with 95% confidence intervals on MNIST and CIFAR at varying sparsity levels for LeNet 300-100 and WRN 28-2.

4 Results

Results in Table 1 and Table 2 follow the procedure of (Lee et al., 2019). On MNIST, sparse momentum does very well for the LeNet-5 Caffe model achieving equal performance to the dense baseline with 20% weights. For LeNet 300-100, sparse momentum outperforms baselines when using a moderate amount of weights and for 20% exceeds dense baseline performance. However, for 1-2% of weights, variational dropout is more effective.

On CIFAR-10 in Table 2, we can see that sparse momentum outperforms Single-shot Network Pruning (SNIP) for all models and can achieve the same performance level as dense models for VGG16-D and WRN 16-10 with just 5% of weights.

Figure 2 shows the results on MNIST and CIFAR that follows the experimental procedure of Mostafa and Wang (2019). For LeNet 300-100 on MNIST, we can see that sparse momentum outperforms all other methods. For CIFAR-10, sparse momentum is better than dynamic sparse in 4 out of 5 cases. However, in general, the confidence intervals for most methods overlap – this particular setup for CIFAR-10 with specifically selected dense weights seems to be too easy to differentiate performance between methods and we do not recommend this setup for future work. Sparse momentum outperforms all other methods on ImageNet (ILSVRC2012) as shown in Table 3.

Table 1: MNIST test set performance (±standard error). W indicates the density (%) of the weights.
LeNet 300-100 LeNet-5 Caffe
W (%) Error (%) W (%) Error (%)
Dense 100.0 1.34±0.011 100.0 0.58±0.010
Opt. Brain Damage (LeCun et al., 1989) 8.0 2.0 8.0 2.7
Layer-wise Brain Damage (Dong et al., 2017) 1.5 2.0 1.0 2.1
Compression via optimization** 1.0 3.2 1.0 1.1
Single-shot Net. Pruning (Lee et al., 2019) 2.0 2.4 1.0 1.1
Soft weight-sharing (Ullrich et al., 2017) 4.4 1.9 0.5 1.0
Dyn. Network Surgery (Guo et al., 2016) 1.8 2.0 0.9 0.9
Learn weights&connections (Han et al., 2015) 8.3 1.6 9.3 0.8
Single-shot Net. Pruning (Lee et al., 2019) 5.0 1.6 2.0 0.8
Variational Dropout (Molchanov et al., 2017) 1.5 1.9 0.4 0.8
Sparse Momentum 1.0 2.36±0.044 1.0 0.83±0.040
2.0 1.99±0.019 2.0 0.76±0.022
5.0 1.53±0.020 5.0 0.69±0.021
20.0 1.26±0.017* 20.0 0.60±0.013*
* 95% confidence intervals overlap with or exceed dense model.
** (Carreira-Perpinán and Idelbayev, 2018).
Table 2: CIFAR-10 test set error (±standard error) for dense baselines, Sparse Momentum and SNIP.
Sparse Error (%)
Model Dense Error (%) SNIP Momentum Weights (%)
AlexNet-s 12.95±0.056 14.99 14.35±0.057 10
AlexNet-b 12.85±0.068 14.50 13.93±0.048 10
VGG16-C 6.49±0.038 7.27 6.77±0.056 5
VGG16-D 6.59±0.050 7.09 6.49±0.045* 5
VGG16-like 6.50±0.054 8.00 6.71±0.046 3
WRN-16-8 4.57±0.022 6.63 5.66±0.054 5
WRN-16-10 4.45±0.040 6.43 4.59±0.043* 5
WRN-22-8 4.26±0.032 5.85 4.96±0.042 5
* 95% confidence intervals overlap with dense model.
Table 3: ImageNet results for sparse momentum. Other results are from Mostafa and Wang (2019).
Accuracy (%)
Model Top-1 Top-5 Top-1 Top-5
Dense baseline (He et al., 2016) 79.3 94.8 79.3 94.8
10% weights 20% Weights
Static sparse (Mostafa and Wang, 2019) 67.8 88.4 71.6 90.4
Thin Dense (Mostafa and Wang, 2019) 70.7 89.9 72.4 90.9
DeepR (Bellec et al., 2018) 70.2 90.0 71.7 90.6
Compressed sparse (Mostafa and Wang, 2019) 70.3 90.0 73.2 91.5
Sparse Evolutionary Training (Mocanu et al., 2018) 70.4 90.1 72.6 91.2
Dynamic Sparse (Mostafa and Wang, 2019) 71.6 90.5 73.3 92.4
Sparse momentum 73.1 91.5 74.9 92.5

4.1 Speedups and Overhead

We estimated the speedups that could be obtained using sparse momentum in two ways: Theoretical speedups for sparse convolution algorithms and practical speedups using dense convolutional algorithms. For our sparse convolution estimates, we first benchmark the time taken for each dense convolutional layer for a training run and scale it by the sparsity to estimate the speedups gained (equivalent to FLOPs saved). This reflects the maximum speedup for our sparse networks, which can be obtained if optimized sparse convolution algorithms are used. While a fast sparse convolution algorithm for coarse block structures exist for GPUs (Gray et al., 2017), optimal sparse convolution algorithms for fine-grained patterns do not and need to be developed to enable these speedups.

The second method measures practical speedups that can be obtained with naive, dense convolution algorithms which are available today. For dense convolution algorithms, we estimate speedups as follows: If a convolutional channel does only contain zero-valued weights, we can remove these channels from the computation without any consequences and obtain speedups. We assume a linear speedup with an increasing number of empty convolutional channels. We use an RTX Titan and measure the run-time of a dense convolution in 32-bit. We then scale these measurements obtained by the proportion of empty convolutional channels. Using this measure, we estimated the speedups for our models on CIFAR-10. The resulting speedups can be seen in Table 4. We see that dense convolution speedups are mostly dependent on width, with wider networks receiving larger speedups. Sparse convolution speedups are particularly pronounced for Wide Residual Networks (WRN). These results highlight the importance to develop optimized algorithms for sparse convolution.

Beyond speedups, we also measured the overhead of our sparse momentum procedure to be equivalent of a slowdown to 0.973x±0.029x compared to a dense baseline.

Table 4: Speedups for sparse networks on CIFAR-10 compared to dense baselines.
Speedups Weights (%)
Model Dense Convolution Sparse Convolution
AlexNet-s 1.45x 4.00x 10
VGG16-D 1.36x 3.51x 5
WRN 28-2 1.19x 5.82x 5
WRN 16-10 1.16x 11.85x 5

5 Analysis

5.1 Ablation Analysis

Our method differs from previous methods like Sparse Evolutionary Training and Dynamic Sparse Reparameterization in two ways: (1) redistribution of weights and (2) growth of weights. To better understand how these components contribute to the overall performance, we ablate these components on CIFAR-10 for VGG16-D and MNIST for LeNet 300-100 and LeNet-5 Caffe with 5% weights for all experiments. The results can be seen in Table 5.

Redistibution according to the magnitude of momentum increases the performance the most for the deeper networks VGG16-D and LeNet-5 Caffe. We hypothesize that the benefit of redistribution algorithms is proportional to the level of depth of networks: The deeper a network is, the more reliant is it to learn a hierarchy of features across layers – redistribution facilitates the learning of hierarchies by moving parameters from shallow layers to deeper layers as training progresses.

Momentum growth increases performance for LeNet 300-100 reliably. There is some evidence that random growth improves performance slightly for VGG16-D and LeNet-5 Caffe, but the confidence intervals overlap, and this observation might be a statistical anomaly. Furthermore, the use of random growth distributes parameters across all convolutional channels, and thus it is no longer possible to achieve speedups with dense convolutional algorithms – this is contrary to the main goal of our work. If one is interested in predictive performance, it is more reasonable to increase the number of parameters and use momentum growth, which would yield both better performance and provide speedups compared to random growth.

Table 5: Ablation analysis for different growth and redistribution algorithm combinations.
Test error (%)
Redistribution Growth VGG16-D LeNet 300-100 LeNet-5 Caffe
momentum momentum 6.49±0.045 1.53±0.020 0.69±0.021
momentum random -0.15±0.054 +0.07±0.022 -0.05±0.011
None momentum +0.79±0.082 +0.01±0.018 +0.32±0.071
None random +0.49±0.060 +0.11±0.020 +0.13±0.013
Figure 3: Dense vs sparse histograms of class-specialization for convolutional channels on CIFAR-10. A class-specialization of 0.5 indicates that 50% of the overall activity comes from a single class.

5.2 Dense vs Sparse Features

Sparse networks need to use every weight effectively to build feature representations which are competitive with dense networks. In this section, we study the difference between sparse and dense features to further our understanding of what features look like that enable sparse learning.

For feature visualization, it is common to backpropagate activity to the inputs to be able to visualize what these activities represent (Simonyan et al., 2013; Zeiler and Fergus, 2014; Springenberg et al., 2014). However, in our case, we are more interested in the overall distribution of features for each layer within our network, and as such we want to look at the magnitude of the activity in a channel since – unlike feature visualization – we are not just interested in feature detectors but also discriminators. For example, a face detector would induce positive activity for a ‘person’ class but might produce negative activity for a ‘mushroom’ class. Both kinds of activity are useful.

With this reasoning, we develop the following convolutional channel-activation analysis: (1) pass the entire training set through the network and aggregate the magnitude of the activation in each convolutional channel separately for each class; (2) normalize across classes to receive for each channel the proportion of activation which is due to each class; (3) look at the maximum proportion of each channel as a measure of class specialization: a maximum proportion of 1/Nc where Nc is the number of classes indicates that the channel is equally active for all classes in the training set. The higher the proportion deviates from this value, the more is a channel specialized for a particular class.

Results of this method can be seen for AlexNet-s, VGG16-D, and WRN 28-2 on CIFAR-10 in Figure 3. We see the convolutional channels in sparse networks have lower class-specialization indicating they learn features which are useful for a broader range of classes compared to dense networks. This trend intensifies with depth. This suggests that sparse networks might be able to rival dense networks by learning more general features.

6 Conclusion and Future Work

We presented our sparse learning algorithm, sparse momentum, which uses the mean magnitude of momentum to grow and redistribute weights. We showed that sparse momentum outperforms other sparse algorithms on MNIST, CIFAR-10, and ImageNet. Additionally, sparse momentum can rival dense neural network performance while yielding speedups. In our analysis, we showed that sparse networks might be able to rival dense networks by learning more general features compared to dense models. We believe that further study of sparse networks and their representations can inform the design of architectures and deep feature learning algorithms. To fully utilize the improved run-time performance of sparse learning algorithms, future research should focus on specialized sparse convolution and sparse matrix multiplication algorithms.

7 Acknowledgements

This work was funded by a Jeff Dean – Heidi Hopper Endowed Regental Fellowship. We thank Ofir Press, Jungo Kasai, Omer Levy, Sebastian Riedel and Yejin Choi for helpful discussions. We thank Ofir Press, Jungo Kasai, Judit Acs, Zoey Chen, Ethan Perez, and Mohit Shridhar for their helpful reviews and comments.


  • Bellec et al. (2018) Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A. (2018). Deep rewiring: Training very sparse deep networks. CoRR, abs/1711.05136.
  • Carreira-Perpinán and Idelbayev (2018) Carreira-Perpinán, M. A. and Idelbayev, Y. (2018). “learning-compression” algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8532–8541.
  • Chauvin (1988) Chauvin, Y. (1988). A back-propagation algorithm with optimal use of hidden units. In NIPS.
  • Child et al. (2019) Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. CoRR, abs/1904.10509.
  • Dai et al. (2017) Dai, X., Yin, H., and Jha, N. K. (2017). Nest: A neural network synthesis tool based on a grow-and-prune paradigm. CoRR, abs/1711.02017.
  • Dai et al. (2018) Dai, X., Yin, H., and Jha, N. K. (2018). Grow and prune compact, fast, and accurate lstms. CoRR, abs/1805.11797.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  • Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  • Dong et al. (2017) Dong, X., Chen, S., and Pan, S. J. (2017). Learning to prune deep neural networks via layer-wise optimal brain surgeon. In NIPS.
  • Frankle and Carbin (2019) Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019.
  • Frankle et al. (2019) Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2019). The lottery ticket hypothesis at scale. CoRR, abs/1903.01611.
  • Gale et al. (2019) Gale, T., Elsen, E., and Hooker, S. (2019). The state of sparsity in deep neural networks. CoRR, abs/1902.09574.
  • Gray et al. (2017) Gray, S., Radford, A., and Kingma, D. P. (2017). Gpu kernels for block-sparse weights.
  • Guo et al. (2016) Guo, Y., Yao, A., and Chen, Y. (2016). Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387.
  • Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143.
  • Hassibi and Stork (1992) Hassibi, B. and Stork, D. G. (1992). Second order derivatives for network pruning: Optimal brain surgeon. In NIPS.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
  • Herculano-Houzel et al. (2010) Herculano-Houzel, S., Mota, B., Wong, P., and Kaas, J. H. (2010). Connectivity-driven white matter scaling and folding in primate cerebral cortex. Proceedings of the National Academy of Sciences of the United States of America, 107 44:19008–13.
  • Ishikawa (1996) Ishikawa, M. (1996). Structural learning with forgetting. Neural Networks, 9:509–521.
  • Karnin (1990) Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks. IEEE transactions on neural networks, 1 2:239–42.
  • Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, Citeseer.
  • LeCun (1998) LeCun, Y. (1998). Gradient-based learning applied to document recognition.
  • LeCun et al. (1989) LeCun, Y., Denker, J. S., and Solla, S. A. (1989). Optimal brain damage. In NIPS.
  • Lee et al. (2019) Lee, N., Ajanthan, T., and Torr, P. H. S. (2019). Snip: Single-shot network pruning based on connection sensitivity. In ICLR 2019.
  • Louizos et al. (2017) Louizos, C., Ullrich, K., and Welling, M. (2017). Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3288–3298.
  • Louizos et al. (2018) Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through l0 regularization. CoRR, abs/1712.01312.
  • Mocanu et al. (2018) Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383.
  • Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. P. (2017). Variational dropout sparsifies deep neural networks. In International Conference on MachineLearning (ICML).
  • Mostafa and Wang (2019) Mostafa, H. and Wang, X. (2019). Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning (ICML).
  • Mozer and Smolensky (1988) Mozer, M. C. and Smolensky, P. (1988). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In NIPS.
  • Narang et al. (2017) Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. (2017). Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch.
  • Qian (1999) Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society, 12 1:145–151.
  • Simonyan et al. (2013) Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034.
  • Springenberg et al. (2014) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. (2014). Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806.
  • Ullrich et al. (2017) Ullrich, K., Meeds, E., and Welling, M. (2017). Soft weight-sharing for neural network compression. CoRR, abs/1702.04008.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Zagoruyko and Komodakis (2016) Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. ArXiv, abs/1605.07146.
  • Zeiler and Fergus (2014) Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.
  • Zhou et al. (2019) Zhou, H., Lan, J., Liu, R., and Yosinski, J. (2019). Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067.
  • Zhu and Gupta (2018) Zhu, M. and Gupta, S. (2018). To prune, or not to prune: Exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878.