Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Abstract

Multimodal Large Language Models (MLLMs) exhibit impressive performanceacross various visual tasks. Subsequent investigations into enhancing theirvisual reasoning abilities have significantly expanded their performanceenvelope. However, a critical bottleneck in the advancement of MLLMs towarddeep visual reasoning is their heavy reliance on curated image-textsupervision. To solve this problem, we introduce a novel framework termed``Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs to learncomplex visual reasoning from only raw images. This framework builds on the``Asymmetry of Verification'' principle to train MLLMs, i.e., verifying therendered output against a source image is easier than generating it. Wedemonstrate that this relative ease provides an ideal reward signal foroptimization via Reinforcement Learning (RL) training, reducing reliance on theimage-text supervision. Guided by the above principle, RRVF implements aclosed-loop iterative process encompassing reasoning, rendering, and visualfeedback components, enabling the model to perform self-correction throughmulti-turn interactions, while this pipeline can be optimized end-to-end by theGRPO algorithm. Extensive evaluations are conducted on image-to-code generationacross two diverse domains: data charts and web interfaces. The RRVF-trainedmodel not only outperforms existing open-source MLLMs and supervisedfine-tuning baselines but also exhibits superior generalization to unseendatasets. Critically, the model's performance surpasses that of the moreadvanced MLLM used to provide the feedback signal during training. This workestablishes a self-improvement paradigm that offers a viable path to robust,generalizable models without reliance on explicit supervision. Code will beavailable at https://github.com/L-O-I/RRVF.

Quick Read (beta)

loading the full paper ...