Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

  • 2025-07-28 12:21:19
  • Yang Chen, Yufan Shen, Wenxuan Huang, Shen Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Botian Shi, Yu Qiao
  • 0

Abstract

Multimodal Large Language Models (MLLMs) have exhibited impressiveperformance across various visual tasks. Subsequent investigations intoenhancing their visual reasoning abilities have significantly expanded theirperformance envelope. However, a critical bottleneck in the advancement ofMLLMs toward deep visual reasoning is their heavy reliance on curatedimage-text supervision. To solve this problem, we introduce a novel frameworktermed ``Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs tolearn complex visual reasoning from only raw images. This framework builds onthe ``Asymmetry of Verification'' principle to train MLLMs, i.e., verifying therendered output against a source image is easier than generating it. Wedemonstrate that this relative ease provides an ideal reward signal foroptimization via Reinforcement Learning (RL) training, reducing the reliance onthe image-text supervision. Guided by the above principle, RRVF implements aclosed-loop iterative process encompassing reasoning, rendering, and visualfeedback components, enabling the model to perform self-correction throughmulti-turn interactions and tool invocation, while this pipeline can beoptimized by the GRPO algorithm in an end-to-end manner. Extensive experimentson image-to-code generation for data charts and web interfaces show that RRVFsubstantially outperforms existing open-source MLLMs and surpasses supervisedfine-tuning baselines. Our findings demonstrate that systems driven by purelyvisual feedback present a viable path toward more robust and generalizablereasoning models without requiring explicit supervision. Code will be availableat https://github.com/L-O-I/RRVF.

 

Quick Read (beta)

loading the full paper ...