Abstract
The training of sparse neural networks is becoming an increasingly importanttool for reducing the computational footprint of models at training andevaluation, as well enabling the effective scaling up of models. Whereas muchwork over the years has been dedicated to specialised pruning techniques,little attention has been paid to the inherent effect of gradient basedtraining on model sparsity. In this work, we introduce Powerpropagation, a newweight-parameterisation for neural networks that leads to inherently sparsemodels. Exploiting the behaviour of gradient descent, our method gives rise toweight updates exhibiting a "rich get richer" dynamic, leaving low-magnitudeparameters largely unaffected by learning. Models trained in this mannerexhibit similar performance, but have a distribution with markedly higherdensity at zero, allowing more parameters to be pruned safely. Powerpropagationis general, intuitive, cheap and straight-forward to implement and can readilybe combined with various other techniques. To highlight its versatility, weexplore it in two very different settings: Firstly, following a recent line ofwork, we investigate its effect on sparse training for resource-constrainedsettings. Here, we combine Powerpropagation with a traditional weight-pruningtechnique as well as recent state-of-the-art sparse-to-sparse algorithms,showing superior performance on the ImageNet benchmark. Secondly, we advocatethe use of sparsity in overcoming catastrophic forgetting, where compressedrepresentations allow accommodating a large number of tasks at fixed modelcapacity. In all cases our reparameterisation considerably increases theefficacy of the off-the-shelf methods.