Knowledge Distillation Beyond Model Compression

Abstract

Knowledge distillation (KD) is commonly deemed as an effective modelcompression technique in which a compact model (student) is trained under thesupervision of a larger pretrained model or an ensemble of models (teacher).Various techniques have been proposed since the original formulation, whichmimic different aspects of the teacher such as the representation space,decision boundary, or intra-data relationship. Some methods replace the one-wayknowledge distillation from a static teacher with collaborative learningbetween a cohort of students. Despite the recent advances, a clearunderstanding of where knowledge resides in a deep neural network and anoptimal method for capturing knowledge from teacher and transferring it tostudent remains an open question. In this study, we provide an extensive studyon nine different KD methods which covers a broad spectrum of approaches tocapture and transfer knowledge. We demonstrate the versatility of the KDframework on different datasets and network architectures under varyingcapacity gaps between the teacher and student. The study provides intuition forthe effects of mimicking different aspects of the teacher and derives insightsfrom the performance of the different distillation approaches to guide thedesign of more effective KD methods. Furthermore, our study shows theeffectiveness of the KD framework in learning efficiently under varyingseverity levels of label noise and class imbalance, consistently providinggeneralization gains over standard training. We emphasize that the efficacy ofKD goes much beyond a model compression technique and it should be consideredas a general-purpose training paradigm which offers more robustness to commonchallenges in the real-world datasets compared to the standard trainingprocedure.

Quick Read (beta)

loading the full paper ...