LGDN: Language-Guided Denoising Network for Video-Language Modeling

Abstract

Video-language modeling has attracted much attention with the rapid growth ofweb videos. Most existing methods assume that the video frames and textdescription are semantically correlated, and focus on video-language modelingat video level. However, this hypothesis often fails for two reasons: (1) Withthe rich semantics of video contents, it is difficult to cover all frames witha single video-level description; (2) A raw video typically hasnoisy/meaningless information (e.g., scenery shot, transition or teaser).Although a number of recent works deploy attention mechanism to alleviate thisproblem, the irrelevant/noisy information still makes it very difficult toaddress. To overcome such challenge, we thus propose an efficient and effectivemodel, termed Language-Guided Denoising Network (LGDN), for video-languagemodeling. Different from most existing methods that utilize all extracted videoframes, LGDN dynamically filters out the misaligned or redundant frames underthe language supervision and obtains only 2--4 salient frames per video forcross-modal token-level alignment. Extensive experiments on five publicdatasets show that our LGDN outperforms the state-of-the-arts by large margins.We also provide detailed ablation study to reveal the critical importance ofsolving the noise issue, in hope of inspiring future video-language work.

Quick Read (beta)

loading the full paper ...