InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Abstract

We introduce InternVideo2, a new video foundation model (ViFM) that achievesthe state-of-the-art performance in action recognition, video-text tasks, andvideo-centric dialogue. Our approach employs a progressive training paradigmthat unifies the different self- or weakly-supervised learning frameworks ofmasked video token reconstruction, cross-modal contrastive learning, and nexttoken prediction. Different training stages would guide our model to capturedifferent levels of structure and semantic information through differentpretext tasks. At the data level, we prioritize the spatiotemporal consistencyby semantically segmenting videos and generating video-audio-speech captions.This improves the alignment between video and text. We scale both data andmodel size for our InternVideo2. Through extensive experiments, we validate ourdesigns and demonstrate the state-of-the-art performance on over 60 video andaudio tasks. Notably, our model outperforms others on various video-relatedcaptioning, dialogue, and long video understanding benchmarks, highlighting itsability to reason and comprehend long temporal contexts. Code and models areavailable at https://github.com/OpenGVLab/InternVideo2/.

Quick Read (beta)

loading the full paper ...