VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Abstract

Reinforcement Learning Finetuning (RFT) has significantly advanced thereasoning capabilities of large language models (LLMs) by enabling long chainsof thought, self-correction, and effective tool use. While recent works attemptto extend RFT to vision-language models (VLMs), these efforts largely producetext-only reasoning conditioned on static image inputs, falling short of truemultimodal reasoning in the response. In contrast, test-time methods likeVisual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generatemultimodal chains of thought by interleaving text and intermediate visualreasoning steps. VTool-R1 integrates Python-based visual editing tools into theRFT process, enabling VLMs to learn when and how to generate visual reasoningsteps that benefit final reasoning. Trained with outcome-based rewards tied totask accuracy, our approach elicits strategic visual tool use for reasoningwithout relying on process-based supervision. Experiments on structured visualquestion answering over charts and tables show that VTool-R1 enhances reasoningperformance by teaching VLMs to "think with images" and generate multimodalchain of thoughts with tools.

Quick Read (beta)

loading the full paper ...