On the Efficacy of Knowledge Distillation

  • 2019-10-03 08:14:13
  • Jang Hyun Cho, Bharath Hariharan
  • 21

Abstract

In this paper, we present a thorough evaluation of the efficacy of knowledgedistillation and its dependence on student and teacher architectures. Startingwith the observation that more accurate teachers often don't make goodteachers, we attempt to tease apart the factors that affect knowledgedistillation performance. We find crucially that larger models do not oftenmake better teachers. We show that this is a consequence of mismatchedcapacity, and that small students are unable to mimic large teachers. We findtypical ways of circumventing this (such as performing a sequence of knowledgedistillation steps) to be ineffective. Finally, we show that this effect can bemitigated by stopping the teacher's training early. Our results generalizeacross datasets and models.

 

Quick Read (beta)

loading the full paper ...