Expanding Language-Image Pretrained Models for General Video Recognition

Abstract

Contrastive language-image pretraining has shown great success in learningvisual-textual joint representation from web-scale data, demonstratingremarkable "zero-shot" generalization ability for various image tasks. However,how to effectively expand such new language-image pretraining methods to videodomains is still an open problem. In this work, we present a simple yeteffective approach that adapts the pretrained language-image models to videorecognition directly, instead of pretraining a new model from scratch. Moreconcretely, to capture the long-range dependencies of frames along the temporaldimension, we propose a cross-frame attention mechanism that explicitlyexchanges information across frames. Such module is lightweight and can beplugged into pretrained language-image models seamlessly. Moreover, we proposea video-specific prompting scheme, which leverages video content informationfor generating discriminative textual prompts. Extensive experimentsdemonstrate that our approach is effective and can be generalized to differentvideo recognition scenarios. In particular, under fully-supervised settings,our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shotexperiments, our approach surpasses the current state-of-the-art methods by+7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. Infew-shot scenarios, our approach outperforms previous best methods by +32.1%and +23.1% when the labeled data is extremely limited. Code and models areavailable at https://aka.ms/X-CLIP

Quick Read (beta)

loading the full paper ...