Abstract
Feature regression is a simple way to distill large neural network models tosmaller ones. We show that with simple changes to the network architecture,regression can outperform more complex state-of-the-art approaches forknowledge distillation from self-supervised models. Surprisingly, the additionof a multi-layer perceptron head to the CNN backbone is beneficial even if usedonly during distillation and discarded in the downstream task. Deepernon-linear projections can thus be used to accurately mimic the teacher withoutchanging inference architecture and time. Moreover, we utilize independentprojection heads to simultaneously distill multiple teacher networks. We alsofind that using the same weakly augmented image as input for both teacher andstudent networks aids distillation. Experiments on ImageNet dataset demonstratethe efficacy of the proposed changes in various self-supervised distillationsettings.