VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Abstract

Recent studies have revealed that selecting informative and relevant videoframes can significantly improve the performance of Video Large Language Models(Video-LLMs). Current methods, such as reducing inter-frame redundancy,employing separate models for image-text relevance assessment, or utilizingtemporal video grounding for event localization, substantially adoptunsupervised learning paradigms, whereas they struggle to address the complexscenarios in long video understanding. We propose Instructed Temporal Groundingfor Videos (VideoITG), featuring customized frame sampling aligned with userinstructions. The core of VideoITG is the VidThinker pipeline, an automatedannotation framework that explicitly mimics the human annotation process.First, it generates detailed clip-level captions conditioned on theinstruction; then, it retrieves relevant video segments throughinstruction-guided reasoning; finally, it performs fine-grained frame selectionto pinpoint the most informative visual evidence. Leveraging VidThinker, weconstruct the VideoITG-40K dataset, containing 40K videos and 500K instructedtemporal grounding annotations. We then design a plug-and-play VideoITG model,which takes advantage of visual language alignment and reasoning capabilitiesof Video-LLMs, for effective frame selection in a discriminative manner.Coupled with Video-LLMs, VideoITG achieves consistent performance improvementsacross multiple multimodal video understanding benchmarks, showing itssuperiority and great potentials for video understanding.

Quick Read (beta)

loading the full paper ...