VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

  • 2025-06-11 22:47:49
  • Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt
  • 0

Abstract

Reinforcement Learning Finetuning (RFT) has significantly advanced thereasoning capabilities of large language models (LLMs) by enabling long chainsof thought, self-correction, and effective tool use. While recent works attemptto extend RFT to vision-language models (VLMs), these efforts largely producetext-only reasoning conditioned on static image inputs, falling short of truemultimodal reasoning in the response. In contrast, test-time methods likeVisual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generatemultimodal chains of thought by interleaving text and intermediate visualreasoning steps. VTool-R1 integrates Python-based visual editing tools into theRFT process, enabling VLMs to learn when and how to generate visual reasoningsteps that benefit final reasoning. Trained with outcome-based rewards tied totask accuracy, our approach elicits strategic visual tool use for reasoningwithout relying on process-based supervision. Experiments on structured visualquestion answering over charts and tables show that VTool-R1 enhances reasoningperformance by teaching VLMs to "think with images" and generate multimodalchain of thoughts with tools.

 

Quick Read (beta)

loading the full paper ...