Abstract
Despite significant advancements in video large multimodal models(video-LMMs), achieving effective temporal grounding in long-form videosremains a challenge for existing models. To address this limitation, we proposeTemporal Preference Optimization (TPO), a novel post-training frameworkdesigned to enhance the temporal grounding capabilities of video-LMMs throughpreference learning. TPO adopts a self-training approach that enables models todifferentiate between well-grounded and less accurate temporal responses byleveraging curated preference datasets at two granularities: localized temporalgrounding, which focuses on specific video segments, and comprehensive temporalgrounding, which captures extended temporal dependencies across entire videosequences. By optimizing on these preference datasets, TPO significantlyenhances temporal understanding while reducing reliance on manually annotateddata. Extensive experiments on three long-form video understandingbenchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectivenessof TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPOestablishes itself as the leading 7B model on the Video-MME benchmark,underscoring the potential of TPO as a scalable and efficient solution foradvancing temporal reasoning in long-form video understanding. Project page:https://ruili33.github.io/tpo_website.