Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

Abstract

Text-to-video retrieval systems have recently made significant progress byutilizing pre-trained models trained on large-scale image-text pairs. However,most of the latest methods primarily focus on the video modality whiledisregarding the audio signal for this task. Nevertheless, a recent advancementby ECLIPSE has improved long-range text-to-video retrieval by developing anaudiovisual video representation. Nonetheless, the objective of thetext-to-video retrieval task is to capture the complementary audio and videoinformation that is pertinent to the text query rather than simply achievingbetter audio and video alignment. To address this issue, we introduce TEFAL, aTExt-conditioned Feature ALignment method that produces both audio and videorepresentations conditioned on the text query. Instead of using only anaudiovisual attention block, which could suppress the audio informationrelevant to the text query, our approach employs two independent cross-modalattention blocks that enable the text to attend to the audio and videorepresentations separately. Our proposed method's efficacy is demonstrated onfour benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, andCharades, and achieves better than state-of-the-art performance consistentlyacross the four datasets. This is attributed to the additionaltext-query-conditioned audio representation and the complementary informationit adds to the text-query-conditioned video representation.

Quick Read (beta)

loading the full paper ...