HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Abstract

Despite advancements in multimodal large language models (MLLMs), currentapproaches struggle in medium-to-long video understanding due to frame andcontext length limitations. As a result, these models often depend on framesampling, which risks missing key information over time and lacks task-specificrelevance. To address these challenges, we introduce HierarQ, a task-awarehierarchical Q-Former based framework that sequentially processes frames tobypass the need for frame sampling, while avoiding LLM's context lengthlimitations. We introduce a lightweight two-stream language-guided featuremodulator to incorporate task awareness in video understanding, with the entitystream capturing frame-level object information within a short context and thescene stream identifying their broader interactions over longer period of time.Each stream is supported by dedicated memory banks which enables our proposedHierachical Querying transformer (HierarQ) to effectively capture short andlong-term context. Extensive evaluations on 10 video benchmarks across videounderstanding, question answering, and captioning tasks demonstrate HierarQ'sstate-of-the-art performance across most datasets, proving its robustness andefficiency for comprehensive video analysis.

Quick Read (beta)

loading the full paper ...