Abstract
Large Multimodal Models (LMMs) uniformly perceive video frames, creatingcomputational inefficiency for videos with inherently varying temporalinformation density. This paper present \textbf{Quicksviewer}, an LMM with newperceiving paradigm that partitions a video of nonuniform density into varyingcubes using Gumbel Softmax, followed by a unified resampling for each cube toachieve efficient video understanding. This simple and intuitive approachdynamically compress video online based on its temporal density, significantlyreducing spatiotemporal redundancy (overall 45$\times$ compression rate), whileenabling efficient training with large receptive field. We train the model froma language backbone through three progressive stages, each incorporatinglengthy videos on average of 420s/1fps thanks to the perceiving efficiency.With only 0.8M total video-text samples for training, our model outperforms thedirect baseline employing a fixed partitioning strategy by a maximum of 8.72 inaccuracy, demonstrating the effectiveness in performance. On Video-MME,Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\%of tokens per frame required by baselines. With this paradigm, scaling up thenumber of input frames reveals a clear power law of the model capabilities. Itis also empirically verified that the segments generated by the cubing networkcan help for analyzing continuous events in videos.