ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Abstract

While Reinforcement Learning with Verifiable Reward (RLVR) significantlyadvances image reasoning in Large Vision-Language Models (LVLMs), itsapplication to complex video reasoning remains underdeveloped. This gap stemsprimarily from a critical data bottleneck: existing datasets lack thechallenging, multi-hop questions and high-quality, video-groundedChain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To addressthis, we introduce ReWatch, a large-scale dataset built to foster advancedvideo reasoning. We propose a novel multi-stage synthesis pipeline tosynthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT.A core innovation is our Multi-Agent ReAct framework for CoT synthesis, whichsimulates a human-like "re-watching" process to generate video-groundedreasoning traces by explicitly modeling information retrieval and verification.Building on this dataset, we develop ReWatch-R1 by post-training a strongbaseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. Thisframework incorporates a novel Observation \& Reasoning (O\&R) reward mechanismthat evaluates both the final answer's correctness and the reasoning'salignment with video content, directly penalizing hallucination. Ourexperiments show that ReWatch-R1 achieves state-of-the-art average performanceon five challenging video reasoning benchmarks. Project Page:https://rewatch-r1.github.io

Quick Read (beta)

loading the full paper ...