VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Abstract

Recently DeepSeek R1 has shown that reinforcement learning (RL) cansubstantially improve the reasoning capabilities of Large Language Models(LLMs) through a simple yet effective design. The core of R1 lies in itsrule-based reward formulation, which leverages tasks with deterministicground-truth answers to enable precise and stable reward computation. In thevisual domain, we similarly observe that a wide range of visual understandingtasks are inherently equipped with well-defined ground-truth annotations. Thisproperty makes them naturally compatible with rule-based reward mechanisms.Motivated by this observation, we investigate the extension of R1-stylereinforcement learning to Vision-Language Models (VLMs), aiming to enhancetheir visual reasoning capabilities. To this end, we develop VLM-R1, adedicated framework designed to harness RL for improving VLMs' performance ongeneral vision-language tasks. Using this framework, we further explore thefeasibility of applying RL to visual domain. Experimental results indicate thatthe RL-based model not only delivers competitive performance on visualunderstanding tasks but also surpasses Supervised Fine-Tuning (SFT) ingeneralization ability. Furthermore, we conduct comprehensive ablation studiesthat uncover a series of noteworthy insights, including the presence of rewardhacking in object detection, the emergence of the "OD aha moment", the impactof training data quality, and the scaling behavior of RL across different modelsizes. Through these analyses, we aim to deepen the understanding of howreinforcement learning enhances the capabilities of vision-language models, andwe hope our findings and open-source contributions will support continuedprogress in the vision-language RL community. Our code and model are availableat https://github.com/om-ai-lab/VLM-R1

Quick Read (beta)

loading the full paper ...