Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Abstract

Multi-channel video-language retrieval require models to understandinformation from different channels (e.g. video$+$question, video$+$speech) tocorrectly link a video with a textual response or query. Fortunately,contrastive multimodal models have been shown to be highly effective ataligning entities in images/videos and text, e.g., CLIP; text contrastivemodels have been extensively studied recently for their strong ability ofproducing discriminative sentence embeddings, e.g., SimCSE. Their abilities areexactly needed by multi-channel video-language retrieval. However, there is nota clear way to quickly adapt these two lines to multi-channel video-languageretrieval with limited data and resources. In this paper, we identify aprincipled model design space with two axes: how to represent videos and how tofuse video and text information. Based on categorization of recent methods, weinvestigate the options of representing videos using continuous feature vectorsor discrete text tokens; for the fusion method, we explore the use of amultimodal transformer or a pretrained contrastive text model. We extensivelyevaluate the four combinations on five video-language datasets. We surprisinglyfind that discrete text tokens coupled with a pretrained contrastive text modelyields the best performance, which can even outperform state-of-the-art on theiVQA and How2QA datasets without the additional training on millions ofvideo-language data. Further analysis shows that this is because representingvideos as text tokens captures the key visual information with text tokens thatare naturally aligned with text models and the text models are strongmultimodal retriever after the contrastive pretraining process.

Quick Read (beta)

loading the full paper ...