An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Abstract

Large Multimodal Models (LMMs) uniformly perceive video frames, creatingcomputational inefficiency for videos with inherently varying temporalinformation density. This paper present \textbf{Quicksviewer}, an LMM with newperceiving paradigm that partitions a video of nonuniform density into varyingcubes using Gumbel Softmax, followed by a unified resampling for each cube toachieve efficient video understanding. This simple and intuitive approachdynamically compress video online based on its temporal density, significantlyreducing spatiotemporal redundancy (overall 45$\times$ compression rate), whileenabling efficient training with large receptive field. We train the model froma language backbone through three progressive stages, each incorporatinglengthy videos on average of 420s/1fps thanks to the perceiving efficiency.With only 0.8M total video-text samples for training, our model outperforms thedirect baseline employing a fixed partitioning strategy by a maximum of 8.72 inaccuracy, demonstrating the effectiveness in performance. On Video-MME,Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\%of tokens per frame required by baselines. With this paradigm, scaling up thenumber of input frames reveals a clear power law of the model capabilities. Itis also empirically verified that the segments generated by the cubing networkcan help for analyzing continuous events in videos.

Quick Read (beta)

loading the full paper ...