Abstract
Knowledge distillation introduced in the deep learning context is a method totransfer knowledge from one architecture to another. In particular, when thearchitectures are identical, this is called self-distillation. The idea is tofeed in predictions of the trained model as new target values for retraining(and iterate this loop possibly a few times). It has been empirically observedthat the self-distilled model often achieves higher accuracy on held out data.Why this happens, however, has been a mystery: the self-distillation dynamicsdoes not receive any new information about the task and solely evolves bylooping over training. To the best of our knowledge, there is no rigorousunderstanding of why this happens. This work provides the first theoreticalanalysis of self-distillation. We focus on fitting a nonlinear function totraining data, where the model space is Hilbert space and fitting is subject toL2 regularization in this function space. We show that self-distillationiterations modify regularization by progressively limiting the number of basisfunctions that can be used to represent the solution. This implies (as we alsoverify empirically) that while a few rounds of self-distillation may reduceover-fitting, further rounds may lead to under-fitting and thus worseperformance.