Scaling RL to Long Videos

Abstract

We introduce a full-stack framework that scales up reasoning invision-language models (VLMs) to long videos, leveraging reinforcementlearning. We address the unique challenges of long video reasoning byintegrating three critical components: (1) a large-scale dataset,LongVideo-Reason, comprising 104K long video QA pairs with high-qualityreasoning annotations across diverse domains such as sports, games, and vlogs;(2) a two-stage training pipeline that extends VLMs with chain-of-thoughtsupervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) atraining infrastructure for long video RL, named Multi-modal ReinforcementSequence Parallelism (MR-SP), which incorporates sequence parallelism and avLLM-based engine tailored for long video, using cached video embeddings forefficient rollout and prefilling. In our experiments, LongVILA-R1-7B achievesstrong performance on video benchmarks, reaching 65.0% and 70.7% accuracy onVideoMME without and with subtitles, respectively, and consistentlyoutperforming LongVILA-R1 across multiple benchmarks. Moreover, LongVILA-R1shows steady performance improvements as the number of input video framesincreases. Notably, our MR-SP system achieves up to 2.1x speedup on long videoRL training. In addition, we release our training system for publicavailability that supports RL training on various modalities (video, text, andaudio), various models (VILA and Qwen series), and even image and videogeneration models. On a single A100 node (8 GPUs), it supports RL training onhour-long videos (e.g., 3,600 frames / around 256k tokens).

Quick Read (beta)

loading the full paper ...