Clover: Towards A Unified Video-Language Alignment and Fusion Model

Abstract

Building a universal video-language model for solving various videounderstanding tasks (e.g., text-video retrieval, video question answering) isan open challenge to the machine learning field. Towards this goal, most recentattempts train the models, usually consisting of uni-modal and cross-modalfeature encoders, with supervised or pair-wise contrastive pre-text tasks.Though offering attractive generality, the resulted models have to compromisebetween efficiency and performance. We argue the flaws are caused by theirpre-training strategies\textemdash they cannot well align and fuse featuresfrom different modalities simultaneously. We then introduce Clover -- aCorrelated Video-Language pre-training method -- towards a universalvideo-language model for solving multiple video understanding tasks withneither performance nor efficiency compromise. It improves cross-modal featurealignment and fusion via a novel tri-modal alignment pre-training task.Additionally, we propose to enhance the tri-modal alignment via incorporatinglearning from masked samples and a novel pair-wise ranking loss. It establishesnew state-of-the-arts on multiple downstream tasks, including three retrievaltasks for both zero-shot and fine-tuning settings, and eight video questionanswering tasks. Codes and pre-trained models will be released athttps://github.com/LeeYN-43/Clover.

Quick Read (beta)

loading the full paper ...