Self-Chained Image-Language Model for Video Localization and Question Answering

Abstract

Recent studies have shown promising results on utilizing large pre-trainedimage-language models for video question answering. While these image-languagemodels can efficiently bootstrap the representation learning of video-languagemodels, they typically concatenate uniformly sampled video frames as visualinputs without explicit language-aware, temporal modeling. When only a portionof a video input is relevant to the language query, such uniform frame samplingcan often lead to missing important visual cues. Although humans often find avideo moment to focus on and rewind the moment to answer questions, training aquery-aware video moment localizer often requires expensive annotations andhigh computational costs. To address this issue, we propose Self-Chained VideoLocalization-Answering (SeViLA), a novel framework that leverages a singleimage-language model (BLIP-2) to tackle both temporal keyframe localization andQA on videos. SeViLA framework consists of two modules: Localizer and Answerer,where both are parameter-efficiently fine-tuned from BLIP-2. We propose twoways of chaining these modules for cascaded inference and self-refinement.First, in the forward chain, the Localizer finds multiple language-awarekeyframes in a video, which the Answerer uses to predict the answer. Second, inthe reverse chain, the Answerer generates keyframe pseudo-labels to refine theLocalizer, alleviating the need for expensive video moment localizationannotations. Our SeViLA framework outperforms several strong baselines on 5challenging video QA and event prediction benchmarks, and achieves thestate-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA,STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer,comparisons of Localizer with other temporal localization models,pre-training/self-refinement of Localizer, and varying the number of keyframes.

Quick Read (beta)

loading the full paper ...