Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval

Abstract

Video retrieval requires aligning visual content with corresponding naturallanguage descriptions. In this paper, we introduce Modality Auxiliary Conceptsfor Video Retrieval (MAC-VR), a novel approach that leverages modality-specifictags -- automatically extracted from foundation models -- to enhance videoretrieval. We propose to align modalities in a latent space, along withlearning and aligning auxiliary latent concepts, derived from the features of avideo and its corresponding caption. We introduce these auxiliary concepts toimprove the alignment of visual and textual latent concepts, and so are able todistinguish concepts from one other. We conduct extensive experiments on fivediverse datasets: MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. Theexperimental results consistently demonstrate that modality-specific tagsimprove cross-modal alignment, outperforming current state-of-the-art methodsacross three datasets and performing comparably or better across the other two.

Quick Read (beta)

loading the full paper ...