Abstract
Inspired by DeepSeek-R1's success in eliciting reasoning abilities throughrule-based reinforcement learning (RL), we introduce Video-R1 as the firstattempt to systematically explore the R1 paradigm for eliciting video reasoningwithin multimodal large language models (MLLMs). However, directly applying RLtraining with the GRPO algorithm to video reasoning presents two primarychallenges: (i) a lack of temporal modeling for video reasoning, and (ii) thescarcity of high-quality video-reasoning data. To address these issues, wefirst propose the T-GRPO algorithm, which encourages models to utilize temporalinformation in videos for reasoning. Additionally, instead of relying solely onvideo data, we incorporate high-quality image-reasoning data into the trainingprocess. We have constructed two datasets: Video-R1-COT-165k for SFT cold startand Video-R1-260k for RL training, both comprising image and video data.Experimental results demonstrate that Video-R1 achieves significantimprovements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, aswell as on general video benchmarks including MVBench and TempCompass, etc.Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoningbenchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. Allcodes, models, data are released.