Abstract
Recent studies in 3D spatial reasoning explore data-driven approaches andachieve enhanced spatial reasoning performance with reinforcement learning(RL). However, these methods typically perform spatial reasoning in an implicitmanner, and it remains underexplored whether the acquired 3D knowledgegeneralizes to unseen question types at any stage of the training. In this workwe introduce SpatialReasoner, a novel large vision-language model (LVLM) thataddress 3D spatial reasoning with explicit 3D representations shared betweenstages -- 3D perception, computation, and reasoning. Explicit 3Drepresentations provide a coherent interface that supports advanced 3D spatialreasoning and enable us to study the factual errors made by LVLMs. Results showthat our SpatialReasoner achieve improved performance on a variety of spatialreasoning benchmarks and generalizes better when evaluating on novel 3D spatialreasoning questions. Our study bridges the 3D parsing capabilities of priorvisual foundation models with the powerful reasoning abilities of largelanguage models, opening new directions for 3D spatial reasoning.