CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Abstract

Contrastive language-image pretraining (CLIP) links vision and languagemodalities into a unified embedding space, yielding the tremendous potentialfor vision-language (VL) tasks. While early concurrent works have begun tostudy this potential on a subset of tasks, important questions remain: 1) Whatis the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit inlow-shot or domain-shifted scenarios? 3) Can CLIP improve existing approacheswithout impacting inference or pretraining complexity? In this work, we seek toanswer these questions through two key contributions. First, we introduce anevaluation protocol that includes Visual Commonsense Reasoning (VCR), VisualEntailment (SNLI-VE), and Visual Question Answering (VQA), across a variety ofdata availability constraints and conditions of domain shift. Second, wepropose an approach, named CLIP Targeted Distillation (CLIP-TD), tointelligently distill knowledge from CLIP into existing architectures using adynamically weighted objective applied to adaptively selected tokens perinstance. Experiments demonstrate that our proposed CLIP-TD leads toexceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to71.3%) conditions of VCR, while simultaneously improving performance understandard fully-supervised conditions (up to 2%), achieving state-of-artperformance on VCR compared to other single models that are pretrained withimage-text data only. On SNLI-VE, CLIP-TD produces significant gains inlow-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). OnVQA, CLIP-TD provides improvement in low-shot (up to 9%), and infully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent worksutilizing CLIP for finetuning, as well as baseline naive distillationapproaches. Code will be made available.

Quick Read (beta)

loading the full paper ...