Distilling Vision-Language Models on Millions of Videos

  • 2024-04-15 22:10:37
  • Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan
  • 0

Abstract

The recent advance in vision-language models is largely attributed to theabundance of image-text data. We aim to replicate this success forvideo-language models, but there simply is not enough human-curated video-textdata available. We thus resort to fine-tuning a video-language model from astrong image-language baseline with synthesized instructional data. Theresulting video model by video-instruction-tuning (VIIT) is then used toauto-label millions of videos to generate high-quality captions. We show theadapted video-language model performs well on a wide range of video-languagebenchmarks. For instance, it surpasses the best prior result on open-endedNExT-QA by 2.8%. Besides, our model generates detailed descriptions forpreviously unseen videos, which provide better textual supervision thanexisting methods. Experiments show that a video-language dual-encoder modelcontrastively trained on these auto-generated captions is 3.8% better than thestrongest baseline that also leverages vision-language models. Our best modeloutperforms state-of-the-art methods on MSR-VTT zero-shot text-to-videoretrieval by 6%. As a side product, we generate the largest video captiondataset to date.

 

Quick Read (beta)

loading the full paper ...