Learning Video Context as Interleaved Multimodal Sequences

Abstract

Narrative videos, such as movies, pose significant challenges in videounderstanding due to their rich contexts (characters, dialogues, storylines)and diverse demands (identify who, relationship, and reason). In this paper, weintroduce MovieSeq, a multimodal language model developed to address the widerange of challenges in understanding video contexts. Our core idea is torepresent videos as interleaved multimodal sequences (including images, plots,videos, and subtitles), either by linking external knowledge databases or usingoffline models (such as whisper for subtitles). Through instruction-tuning,this approach empowers the language model to interact with videos usinginterleaved multimodal instructions. For example, instead of solely relying onvideo as input, we jointly provide character photos alongside their names anddialogues, allowing the model to associate these elements and generate morecomprehensive responses. To demonstrate its effectiveness, we validateMovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA)across five settings (video classification, audio description, video-textretrieval, video captioning, and video question-answering). The code will bepublic at https://github.com/showlab/MovieSeq.

Quick Read (beta)

loading the full paper ...