Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Abstract

Despite advances in reinforcement learning (RL)-based video reasoning withlarge language models (LLMs), data collection and fine-tuning remainsignificant challenges. These methods often rely on large-scale supervisedfine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT)annotations, making them costly and hard to scale. To address this, we presentVideo-RTS, a new approach to improve video reasoning capability withdrastically improved data efficiency by combining data-efficient RL with avideo-adaptive test-time scaling (TTS) strategy. Building on observations aboutthe data scaling, we skip the resource-intensive SFT step and employ efficientpure-RL training with output-based rewards, requiring no additional annotationsor extensive fine-tuning. Furthermore, to utilize computational resources moreefficiently, we introduce a sparse-to-dense video TTS strategy that improvesinference by iteratively adding frames based on output consistency. We validateour approach on multiple video reasoning benchmarks, showing that Video-RTSsurpasses existing video reasoning models by 2.4% in accuracy using only 3.6%training samples. Specifically, Video-RTS achieves a 4.2% improvement onVideo-Holmes, a recent and challenging video reasoning benchmark. Notably, ourpure RL training and adaptive video TTS offer complementary strengths, enablingVideo-RTS's strong reasoning performance.

Quick Read (beta)

loading the full paper ...