Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation

  • 2021-02-25 18:56:09
  • Kenneth Borup, Lars N. Andersen
  • 0

Abstract

Knowledge distillation is classically a procedure where a neural network istrained on the output of another network along with the original targets inorder to transfer knowledge between the architectures. The special case ofself-distillation, where the network architectures are identical, has beenobserved to improve generalization accuracy. In this paper, we consider aniterative variant of self-distillation in a kernel regression setting, in whichsuccessive steps incorporate both model outputs and the ground-truth targets.This allows us to provide the first theoretical results on the importance ofusing the weighted ground-truth targets in self-distillation. Our focus is onfitting nonlinear functions to training data with a weighted mean square errorobjective function suitable for distillation, subject to $\ell_2$regularization of the model parameters. We show that any such function obtainedwith self-distillation can be calculated directly as a function of the initialfit, and that infinite distillation steps yields the same optimization problemas the original with amplified regularization. Finally, we examine empirically,both in a regression setting and with ResNet networks, how the choice ofweighting parameter influences the generalization performance afterself-distillation.

 

Quick Read (beta)

loading the full paper ...