Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkablecapabilities across diverse tasks, yet they lag significantly behind humans inspatial reasoning. We investigate this gap through Transformation-Driven VisualReasoning (TVR), a challenging task requiring identification of objecttransformations across images under varying viewpoints. While traditionalSupervised Fine-Tuning (SFT) fails to generate coherent reasoning paths incross-view settings, sparse-reward Reinforcement Learning (RL) suffers frominefficient exploration and slow convergence. To address these limitations, wepropose STAR-R1, a novel framework that integrates a single-stage RL paradigmwith a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1rewards partial correctness while penalizing excessive enumeration and passiveinaction, enabling efficient exploration and precise reasoning. Comprehensiveevaluations demonstrate that STAR-R1 achieves state-of-the-art performanceacross all 11 metrics, outperforming SFT by 23% in cross-view scenarios.Further analysis reveals STAR-R1's anthropomorphic behavior and highlights itsunique ability to compare all objects for improving spatial reasoning. Our workprovides critical insights in advancing the research of MLLMs and reasoningmodels. The codes, model weights, and data will be publicly available athttps://github.com/zongzhao23/STAR-R1.