Knowledge distillation: A good teacher is patient and consistent

Abstract

There is a growing discrepancy in computer vision between large-scale modelsthat achieve state-of-the-art performance and models that are affordable inpractical applications. In this paper we address this issue and significantlybridge the gap between these two types of models. Throughout our empiricalinvestigation we do not aim to necessarily propose a new method, but strive toidentify a robust and effective recipe for making state-of-the-art large scalemodels affordable in practice. We demonstrate that, when performed correctly,knowledge distillation can be a powerful tool for reducing the size of largemodels without compromising their performance. In particular, we uncover thatthere are certain implicit design choices, which may drastically affect theeffectiveness of distillation. Our key contribution is the explicitidentification of these design choices, which were not previously articulatedin the literature. We back up our findings by a comprehensive empirical study,demonstrate compelling results on a wide range of vision datasets and, inparticular, obtain a state-of-the-art ResNet-50 model for ImageNet, whichachieves 82.8\% top-1 accuracy.

Quick Read (beta)

loading the full paper ...