Frozen CLIP Models are Efficient Video Learners

Abstract

Video recognition has been dominated by the end-to-end learning paradigm --first initializing a video recognition model with weights of a pretrained imagemodel and then conducting end-to-end training on videos. This enables the videonetwork to benefit from the pretrained image model. However, this requiressubstantial computation and memory resources for finetuning on videos and thealternative of directly using pretrained image features without finetuning theimage backbone leads to subpar results. Fortunately, recent advances inContrastive Vision-Language Pre-training (CLIP) pave the way for a new routefor visual recognition tasks. Pretrained on large open-vocabulary image-textpair data, these models learn powerful visual representations with richsemantics. In this paper, we present Efficient Video Learning (EVL) -- anefficient framework for directly training high-quality video recognition modelswith frozen CLIP features. Specifically, we employ a lightweight Transformerdecoder and learn a query token to dynamically collect frame-level spatialfeatures from the CLIP image encoder. Furthermore, we adopt a local temporalmodule in each decoder layer to discover temporal clues from adjacent framesand their attention maps. We show that despite being efficient to train with afrozen backbone, our models learn high quality video representations on avariety of video recognition datasets. Code is available athttps://github.com/OpenGVLab/efficient-video-recognition.

Quick Read (beta)

loading the full paper ...