Abstract
Knowledge distillation is effective to train small and generalisable networkmodels for meeting the low-memory and fast running requirements. Existingoffline distillation methods rely on a strong pre-trained teacher, whichenables favourable knowledge discovery and transfer but requires a complextwo-phase training procedure. Online counterparts address this limitation atthe price of lacking a highcapacity teacher. In this work, we present anOn-the-fly Native Ensemble (ONE) strategy for one-stage online distillation.Specifically, ONE trains only a single multi-branch network whilesimultaneously establishing a strong teacher on-the- fly to enhance thelearning of target network. Extensive evaluations show that ONE improves thegeneralisation performance a variety of deep neural networks more significantlythan alternative methods on four image classification dataset: CIFAR10,CIFAR100, SVHN, and ImageNet, whilst having the computational efficiencyadvantages.