Abstract
Reinforcement Learning Finetuning (RFT) has significantly advanced thereasoning capabilities of large language models (LLMs) by enabling long chainsof thought, self-correction, and effective tool use. While recent works attemptto extend RFT to vision-language models (VLMs), these efforts largely producetext-only reasoning conditioned on static image inputs, falling short of truemultimodal reasoning in the response. In contrast, test-time methods likeVisual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generatemultimodal chains of thought by interleaving text and intermediate visualreasoning steps. VTool-R1 integrates Python-based visual editing tools into theRFT process, enabling VLMs to learn when and how to generate visual reasoningsteps that benefit final reasoning. Trained with outcome-based rewards tied totask accuracy, our approach elicits strategic visual tool use for reasoningwithout relying on process-based supervision. Experiments on structured visualquestion answering over charts and tables show that VTool-R1 enhances reasoningperformance by teaching VLMs to "think with images" and generate multimodalchain of thoughts with tools.