Multimodal Long Video Modeling Based on Temporal Dynamic Context

Abstract

Recent advances in Large Language Models (LLMs) have led to significantbreakthroughs in video understanding. However, existing models still strugglewith long video processing due to the context length constraint of LLMs and thevast amount of information within the video. Although some recent methods aredesigned for long video understanding, they often lose crucial informationduring token compression and struggle with additional modality like audio. Inthis work, we propose a dynamic long video encoding method utilizing thetemporal relationship between frames, named Temporal Dynamic Context (TDC).Firstly, we segment the video into semantically consistent scenes based oninter-frame similarities, then encode each frame into tokens using visual-audioencoders. Secondly, we propose a novel temporal context compressor to reducethe number of tokens within each segment. Specifically, we employ a query-basedTransformer to aggregate video, audio, and instruction text tokens into alimited set of temporal context tokens. Finally, we feed the static frametokens and the temporal context tokens into the LLM for video understanding.Furthermore, to handle extremely long videos, we propose a training-freechain-of-thought strategy that progressively extracts answers from multiplevideo segments. These intermediate answers serve as part of the reasoningprocess and contribute to the final answer. We conduct extensive experiments ongeneral video understanding and audio-video understanding benchmarks, where ourmethod demonstrates strong performance. The code and models are available athttps://github.com/Hoar012/TDC-Video.

Quick Read (beta)

loading the full paper ...