QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

  • 2025-03-11 18:59:57
  • Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji
  • 0

Abstract

Recent advances in long video understanding typically mitigate visualredundancy through visual token pruning based on attention distribution.However, while existing methods employ post-hoc low-response token pruning indecoder layers, they overlook the input-level semantic correlation betweenvisual tokens and instructions (query). In this paper, we propose QuoTA, anante-hoc training-free modular that extends existing large video-languagemodels (LVLMs) for visual token assignment based on query-oriented frame-levelimportance assessment. The query-oriented token selection is crucial as italigns visual processing with task-specific requirements, optimizing tokenbudget utilization while preserving semantically relevant content.Specifically, (i) QuoTA strategically allocates frame-level importance scoresbased on query relevance, enabling one-time visual token assignment beforecross-modal interactions in decoder layers, (ii) we decouple the query throughChain-of-Thoughts reasoning to facilitate more precise LVLM-based frameimportance scoring, and (iii) QuoTA offers a plug-and-play functionality thatextends to existing LVLMs. Extensive experimental results demonstrate thatimplementing QuoTA with LLaVA-Video-7B yields an average performanceimprovement of 3.2% across six benchmarks (including Video-MME and MLVU) whileoperating within an identical visual token budget as the baseline. Codes areopen-sourced at https://github.com/MAC-AutoML/QuoTA.

 

Quick Read (beta)

loading the full paper ...