Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Abstract

Multimodal Large Language Models (MLLMs) have exhibited impressiveperformance across various visual tasks. Subsequent investigations intoenhancing their visual reasoning abilities have significantly expanded theirperformance envelope. However, a critical bottleneck in the advancement ofMLLMs toward deep visual reasoning is their heavy reliance on curatedimage-text supervision. To solve this problem, we introduce a novel frameworktermed ``Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs tolearn complex visual reasoning from only raw images. This framework builds onthe ``Asymmetry of Verification'' principle to train MLLMs, i.e., verifying therendered output against a source image is easier than generating it. Wedemonstrate that this relative ease provides an ideal reward signal foroptimization via Reinforcement Learning (RL) training, reducing the reliance onthe image-text supervision. Guided by the above principle, RRVF implements aclosed-loop iterative process encompassing reasoning, rendering, and visualfeedback components, enabling the model to perform self-correction throughmulti-turn interactions and tool invocation, while this pipeline can beoptimized by the GRPO algorithm in an end-to-end manner. Extensive experimentson image-to-code generation for data charts and web interfaces show that RRVFsubstantially outperforms existing open-source MLLMs and surpasses supervisedfine-tuning baselines. Our findings demonstrate that systems driven by purelyvisual feedback present a viable path toward more robust and generalizablereasoning models without requiring explicit supervision. Code will be availableat https://github.com/L-O-I/RRVF.

Quick Read (beta)

loading the full paper ...