Abstract
In this paper, we propose a novel cross-modal distillation method, calledTinyCLIP, for large-scale language-image pre-trained models. The methodintroduces two core techniques: affinity mimicking and weight inheritance.Affinity mimicking explores the interaction between modalities duringdistillation, enabling student models to mimic teachers' behavior of learningcross-modal feature alignment in a visual-linguistic affinity space. Weightinheritance transmits the pre-trained weights from the teacher models to theirstudent counterparts to improve distillation efficiency. Moreover, we extendthe method into a multi-stage progressive distillation to mitigate the loss ofinformative weights during extreme compression. Comprehensive experimentsdemonstrate the efficacy of TinyCLIP, showing that it can reduce the size ofthe pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shotperformance. While aiming for comparable performance, distillation with weightinheritance can speed up the training by 1.4 - 7.8 $\times$ compared totraining from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M,achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet,surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9%parameters. Finally, we demonstrate the good transferability of TinyCLIP invarious downstream tasks. Code and models will be open-sourced athttps://aka.ms/tinyclip.