Subclass Distillation - Paper Detail

Abstract

After a large "teacher" neural network has been trained on labeled data, theprobabilities that the teacher assigns to incorrect classes reveal a lot ofinformation about the way in which the teacher generalizes. By training a small"student" model to match these probabilities, it is possible to transfer mostof the generalization ability of the teacher to the student, often producing amuch better small model than directly training the student on the trainingdata. The transfer works best when there are many possible classes because moreis then revealed about the function learned by the teacher, but in cases wherethere are only a few possible classes we show that we can improve the transferby forcing the teacher to divide each class into many subclasses that itinvents during the supervised training. The student is then trained to matchthe subclass probabilities. For datasets where there are known, naturalsubclasses we demonstrate that the teacher learns similar subclasses and theseimprove distillation. For clickthrough datasets where the subclasses areunknown we demonstrate that subclass distillation allows the student to learnfaster and better.

Quick Read (beta)

loading the full paper ...