ViLA: Efficient Video-Language Alignment for Video Question Answering

Abstract

In this work, we propose an efficient Video-Language Alignment (ViLA)network. Our ViLA model addresses both efficient frame sampling and effectivecross-modal alignment in a unified way. In our ViLA network, we design a newlearnable text-guided Frame-Prompter together with a new cross-modaldistillation (QFormer-Distiller) module. Pre-trained large image-languagemodels have shown promising results on problems such as visual questionanswering (VQA). However, how to efficiently and effectively sample videoframes when adapting pre-trained large image-language model to video-languagealignment is still the major challenge. Compared with prior work, our ViLAmodel demonstrates the capability of selecting key frames with criticalcontents, thus improving the video-language alignment accuracy while reducingthe inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall,our ViLA network outperforms the state-of-the-art methods on the videoquestion-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR averagewith 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEPdataset with 4.2X speed-up.

Quick Read (beta)

loading the full paper ...