Abstract
This paper is concerned with self-supervised learning for small models. Theproblem is motivated by our empirical studies that while the widely usedcontrastive self-supervised learning method has shown great progress on largemodel training, it does not work well for small models. To address thisproblem, we propose a new learning paradigm, named SElf-SupErvised Distillation(SEED), where we leverage a larger network (as Teacher) to transfer itsrepresentational knowledge into a smaller architecture (as Student) in aself-supervised fashion. Instead of directly learning from unlabeled data, wetrain a student encoder to mimic the similarity score distribution inferred bya teacher over a set of instances. We show that SEED dramatically boosts theperformance of small networks on downstream tasks. Compared withself-supervised baselines, SEED improves the top-1 accuracy from 42.2% to 67.6%on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on theImageNet-1k dataset.