Abstract
The pre-trained image-text models, like CLIP, have demonstrated the strongpower of vision-language representation learned from a large scale ofweb-collected image-text data. In light of the well-learned visual features,some existing works transfer image representation to video domain and achievegood results. However, how to utilize image-language pre-trained model (e.g.,CLIP) for video-language pre-training (post-pretraining) is still underexplored. In this paper, we investigate two questions: 1) what are the factorshindering post-pretraining CLIP to further improve the performance onvideo-language tasks? and 2) how to mitigate the impact of these factors?Through a series of comparative experiments and analyses, we find that the datascale and domain gap between language sources have great impacts. Motivated bythese, we propose a Omnisource Cross-modal Learning method equipped with aVideo Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive resultsshow that our approach improves the performance of CLIP on video-text retrievalby a large margin. Our model also achieves SOTA results on a variety ofdatasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will releaseour code and pre-trained CLIP-ViP models athttps://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.