Abstract
Multimodal Large Language Models (MLLMs) exhibit impressive performanceacross various visual tasks. Subsequent investigations into enhancing theirvisual reasoning abilities have significantly expanded their performanceenvelope. However, a critical bottleneck in the advancement of MLLMs towarddeep visual reasoning is their heavy reliance on curated image-textsupervision. To solve this problem, we introduce a novel framework termed``Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs to learncomplex visual reasoning from only raw images. This framework builds on the``Asymmetry of Verification'' principle to train MLLMs, i.e., verifying therendered output against a source image is easier than generating it. Wedemonstrate that this relative ease provides an ideal reward signal foroptimization via Reinforcement Learning (RL) training, reducing reliance on theimage-text supervision. Guided by the above principle, RRVF implements aclosed-loop iterative process encompassing reasoning, rendering, and visualfeedback components, enabling the model to perform self-correction throughmulti-turn interactions, while this pipeline can be optimized end-to-end by theGRPO algorithm. Extensive evaluations are conducted on image-to-code generationacross two diverse domains: data charts and web interfaces. The RRVF-trainedmodel not only outperforms existing open-source MLLMs and supervisedfine-tuning baselines but also exhibits superior generalization to unseendatasets. Critically, the model's performance surpasses that of the moreadvanced MLLM used to provide the feedback signal during training. This workestablishes a self-improvement paradigm that offers a viable path to robust,generalizable models without reliance on explicit supervision. Code will beavailable at https://github.com/L-O-I/RRVF.