InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Abstract

This paper aims to improve the performance of video multimodal large languagemodels (MLLM) via long and rich context (LRC) modeling. As a result, we developa new version of InternVideo2.5 with a focus on enhancing the original MLLMs'ability to perceive fine-grained details and capture long-form temporalstructure in videos. Specifically, our approach incorporates dense vision taskannotations into MLLMs using direct preference optimization and developscompact spatiotemporal representations through adaptive hierarchical tokencompression. Experimental results demonstrate this unique design of LRC greatlyimproves the results of video MLLM in mainstream video understanding benchmarks(short & long), enabling the MLLM to memorize significantly longer video inputs(at least 6x longer than the original), and master specialized visioncapabilities like object tracking and segmentation. Our work highlights theimportance of multimodal context richness (length and fineness) in empoweringMLLM's innate abilites (focus and memory), providing new insights for futureresearch on video MLLM. Code and models are available athttps://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5

Quick Read (beta)

loading the full paper ...