TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significantprogress in vision-language tasks, yet they still face challenges whenprocessing long-duration video inputs. The limitation arises from MLLMs'context limit and training costs, necessitating sparse frame sampling beforefeeding videos into MLLMs. Existing video MLLMs adopt training-free uniformsampling or keyframe search, which may miss critical events or be constrainedby the pre-trained models' event understanding capabilities. Meanwhile,building a training-based method remains challenging due to the unsupervisedand non-differentiable nature of sparse frame sampling. To address theseproblems, we propose Temporal Sampling Policy Optimization (TSPO), advancingMLLMs' long-form video-language understanding via reinforcement learning.Specifically, we first propose a trainable event-aware temporal agent, whichcaptures event-query correlation for performing probabilistic keyframeselection. Then, we propose the TSPO reinforcement learning paradigm, whichmodels keyframe selection and language generation as a joint decision-makingprocess, enabling end-to-end group relative optimization with efficientrule-based rewards. Furthermore, for the TSPO's training, we propose a longvideo training data construction pipeline with comprehensive temporal data andvideo Needle-in-a-Haystack data. Finally, we incorporate rule-based answeringaccuracy and temporal locating reward mechanisms to optimize the temporalsampling policy. Comprehensive experiments show that our TSPO achievesstate-of-the-art performance across multiple long video understandingbenchmarks, and shows transferable ability across different cutting-edgeVideo-MLLMs.

Quick Read (beta)

loading the full paper ...