Abstract
Visual reasoning abilities play a crucial role in understanding complexmultimodal data, advancing both domain-specific applications and artificialgeneral intelligence (AGI). Existing methods improve VLM reasoning viaChain-of-Thought (CoT) supervised fine-tuning, using meticulously annotatedtraining data to enhance visual reasoning capabilities. However, this trainingparadigm may lead to overfitting and cognitive rigidity, restricting themodel's ability to transfer visual reasoning skills across domains and limitingits real-world applicability. To address these limitations, we proposeReason-RFT, a novel reinforcement fine-tuning framework that significantlyenhances generalization capabilities in visual reasoning tasks. Reason-RFTintroduces a two-phase training framework for visual reasoning: (1) SupervisedFine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates thereasoning potential of Vision-Language Models (VLMs), followed by (2) GroupRelative Policy Optimization (GRPO)-based reinforcement learning that generatesmultiple reasoning-response pairs, significantly enhancing generalization invisual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities,we reconstructed a comprehensive dataset spanning visual counting, structureperception, and spatial transformation. Experimental results demonstrateReasoning-RFT's three key advantages: (1) Performance Enhancement: achievingstate-of-the-art results across multiple tasks, outperforming most mainstreamopen-source and proprietary models; (2) Generalization Superiority:consistently maintaining robust performance across diverse tasks and domains,outperforming alternative training paradigms; (3) Data Efficiency: excelling infew-shot learning scenarios while surpassing full-dataset SFT baselines.Project website: https://tanhuajie.github.io/ReasonRFT