Good Students Play Big Lottery Better

Abstract

Lottery ticket hypothesis suggests that a dense neural network contains asparse sub-network that can match the test accuracy of the original dense netwhen trained in isolation from (the same) random initialization. However, thehypothesis failed to generalize to larger dense networks such as ResNet-50. Asa remedy, recent studies demonstrate that a sparse sub-network can still beobtained by using a rewinding technique, which is to re-train it fromearly-phase training weights or learning rates of the dense model, rather thanfrom random initialization. Is rewinding the only or the best way to scale up lottery tickets? This paperproposes a new, simpler and yet powerful technique for re-training thesub-network, called "Knowledge Distillation ticket" (KD ticket). Rewindingexploits the value of inheriting knowledge from the early training phase toimprove lottery tickets in large networks. In comparison, KD ticket addresses acomplementary possibility - inheriting useful knowledge from the late trainingphase of the dense model. It is achieved by leveraging the soft labelsgenerated by the trained dense model to re-train the sub-network, instead ofthe hard labels. Extensive experiments are conducted using several large deepnetworks (e.g ResNet-50 and ResNet-110) on CIFAR-10 and ImageNet datasets.Without bells and whistles, when applied by itself, KD ticket performs on paror better than rewinding, while being nearly free of hyperparameters or ad-hocselection. KD ticket can be further applied together with rewinding, yieldingstate-of-the-art results for large-scale lottery tickets.

Quick Read (beta)

loading the full paper ...