Abstract
Traditional knowledge distillation (KD) relies on a proficient teachertrained on the target task, which is not always available. In this setting,cross-task distillation can be used, enabling the use of any teacher modeltrained on a different task. However, many KD methods prove ineffective whenapplied to this cross-task setting. To address this limitation, we propose asimple modification: the use of an inverted projection. We show that thisdrop-in replacement for a standard projector is effective by learning todisregard any task-specific features which might degrade the student'sperformance. We find that this simple modification is sufficient for extendingmany KD methods to the cross-task setting, where the teacher and student taskscan be very different. In doing so, we obtain up to a 1.9% improvement in thecross-task setting compared to the traditional projection, at no additionalcost. Our method can obtain significant performance improvements (up to 7%)when using even a randomly-initialised teacher on various tasks such as depthestimation, image translation, and semantic segmentation, despite the lack ofany learned knowledge to transfer. To provide conceptual and analyticalinsights into this result, we show that using an inverted projection allows thedistillation loss to be decomposed into a knowledge transfer and a spectralregularisation component. Through this analysis we are additionally able topropose a novel regularisation loss that allows teacher-free distillation,enabling performance improvements of up to 8.57% on ImageNet with no additionaltraining costs.