QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Abstract

Recent advances in long video understanding typically mitigate visualredundancy through visual token pruning based on attention distribution.However, while existing methods employ post-hoc low-response token pruning indecoder layers, they overlook the input-level semantic correlation betweenvisual tokens and instructions (query). In this paper, we propose QuoTA, anante-hoc training-free modular that extends existing large video-languagemodels (LVLMs) for visual token assignment based on query-oriented frame-levelimportance assessment. The query-oriented token selection is crucial as italigns visual processing with task-specific requirements, optimizing tokenbudget utilization while preserving semantically relevant content.Specifically, (i) QuoTA strategically allocates frame-level importance scoresbased on query relevance, enabling one-time visual token assignment beforecross-modal interactions in decoder layers, (ii) we decouple the query throughChain-of-Thoughts reasoning to facilitate more precise LVLM-based frameimportance scoring, and (iii) QuoTA offers a plug-and-play functionality thatextends to existing LVLMs. Extensive experimental results demonstrate thatimplementing QuoTA with LLaVA-Video-7B yields an average performanceimprovement of 3.2% across six benchmarks (including Video-MME and MLVU) whileoperating within an identical visual token budget as the baseline. Codes areopen-sourced at https://github.com/MAC-AutoML/QuoTA.

Quick Read (beta)

loading the full paper ...