Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Abstract

Reinforcement Learning (RL) has proven to be an effective post-trainingstrategy for enhancing reasoning in vision-language models (VLMs). GroupRelative Policy Optimization (GRPO) is a recent prominent method thatencourages models to generate complete reasoning traces before answering,leading to increased token usage and computational cost. Inspired by thehuman-like thinking process-where people skip reasoning for easy questions butthink carefully when needed-we explore how to enable VLMs to first decide whenreasoning is necessary. To realize this, we propose TON, a two-stage trainingstrategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective'thought dropout' operation, where reasoning traces are randomly replaced withempty thoughts. This introduces a think-or-not format that serves as a coldstart for selective reasoning; (ii) a GRPO stage that enables the model tofreely explore when to think or not, while maximizing task-aware outcomerewards. Experimental results show that TON can reduce the completion length byup to 90% compared to vanilla GRPO, without sacrificing performance or evenimproving it. Further evaluations across diverse vision-language tasks-coveringa range of reasoning difficulties under both 3B and 7B models-consistentlyreveal that the model progressively learns to bypass unnecessary reasoningsteps as training advances. These findings shed light on the path towardhuman-like reasoning patterns in reinforcement learning approaches. Our code isavailable at https://github.com/kokolerk/TON.

Quick Read (beta)

loading the full paper ...