Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress invision-language tasks, but they continue to struggle with spatialunderstanding. Existing spatial MLLMs often rely on explicit 3D inputs orarchitecture-specific modifications, and remain constrained by large-scaledatasets or sparse supervision. To address these limitations, we introduceSpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatialgrounding with multi-step reasoning. The model simulates human-like spatialperception by constructing a scene graph of task-relevant objects and spatialrelations, and reasoning towards an answer via dense spatial rewards.SpatialThinker consists of two key contributions: (1) a data synthesis pipelinethat generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RLwith a multi-objective dense spatial reward enforcing spatial grounding.SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baselineon spatial understanding and real-world VQA benchmarks, nearly doubling thebase-model gain compared to sparse RL, and surpassing GPT-4o. These resultsshowcase the effectiveness of combining spatial supervision with reward-alignedreasoning in enabling robust 3D spatial understanding with limited data andadvancing MLLMs towards human-level visual reasoning.