Controllable Hybrid Captioner for Improved Long-form Video Understanding

  • 2025-08-25 16:17:48
  • Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy
  • 0

Abstract

Video data, especially long-form video, is extremely dense andhigh-dimensional. Text-based summaries of video content offer a way torepresent query-relevant content in a much more compact manner than raw video.In addition, textual representations are easily ingested by state-of-the-artlarge language models (LLMs), which enable reasoning over video content toanswer complex natural language queries. To solve this issue, we rely on theprogressive construction of a text-based memory by a video captioner operatingon shorter chunks of the video, where spatio-temporal modeling iscomputationally feasible. We explore ways to improve the quality of theactivity log comprised solely of short video captions. Because the videocaptions tend to be focused on human actions, and questions may pertain toother information in the scene, we seek to enrich the memory with static scenedescriptions using Vision Language Models (VLMs). Our video understandingsystem relies on the LaViLa video captioner in combination with a LLM to answerquestions about videos. We first explored different ways of partitioning thevideo into meaningful segments such that the textual descriptions moreaccurately reflect the structure of the video content. Furthermore, weincorporated static scene descriptions into the captioning pipeline using LLaVAVLM, resulting in a more detailed and complete caption log and expanding thespace of questions that are answerable from the textual memory. Finally, wehave successfully fine-tuned the LaViLa video captioner to produce both actionand scene captions, significantly improving the efficiency of the captioningpipeline compared to using separate captioning models for the two tasks. Ourmodel, controllable hybrid captioner, can alternate between different types ofcaptions according to special input tokens that signals scene changes detectedin the video.

 

Quick Read (beta)

loading the full paper ...