Learning Streaming Video Representation via Multitask Training

Abstract

Understanding continuous video streams plays a fundamental role in real-timeapplications including embodied AI and autonomous driving. Unlike offline videounderstanding, streaming video understanding requires the ability to processvideo streams frame by frame, preserve historical information, and makelow-latency decisions.To address these challenges, our main contributions arethree-fold. (i) We develop a novel streaming video backbone, termed asStreamFormer, by incorporating causal temporal attention into a pre-trainedvision transformer. This enables efficient streaming video processing whilemaintaining image representation capability.(ii) To train StreamFormer, wepropose to unify diverse spatial-temporal video understanding tasks within amultitask visual-language alignment framework. Hence, StreamFormer learnsglobal semantics, temporal dynamics, and fine-grained spatial relationshipssimultaneously. (iii) We conduct extensive experiments on online actiondetection, online video instance segmentation, and video question answering.StreamFormer achieves competitive results while maintaining efficiency,demonstrating its potential for real-time applications.

Quick Read (beta)

loading the full paper ...