SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Abstract

Despite recent advances on multi-modal models, 3D spatial reasoning remains achallenging task for state-of-the-art open-source and proprietary models.Recent studies explore data-driven approaches and achieve enhanced spatialreasoning performance by fine-tuning models on 3D-related visualquestion-answering data. However, these methods typically perform spatialreasoning in an implicit manner and often fail on questions that are trivial tohumans, even with long chain-of-thought reasoning. In this work, we introduceSpatialReasoner, a novel large vision-language model (LVLM) that addresses 3Dspatial reasoning with explicit 3D representations shared between multiplestages--3D perception, computation, and reasoning. Explicit 3D representationsprovide a coherent interface that supports advanced 3D spatial reasoning andimproves the generalization ability to novel question types. Furthermore, byanalyzing the explicit 3D representations in multi-step reasoning traces ofSpatialReasoner, we study the factual errors and identify key shortcomings ofcurrent LVLMs. Results show that our SpatialReasoner achieves improvedperformance on a variety of spatial reasoning benchmarks, outperforming Gemini2.0 by 9.2% on 3DSRBench, and generalizes better when evaluating on novel 3Dspatial reasoning questions. Our study bridges the 3D parsing capabilities ofprior visual foundation models with the powerful reasoning abilities of largelanguage models, opening new directions for 3D spatial reasoning.

Quick Read (beta)

loading the full paper ...