SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

Abstract

Humans naturally understand 3D spatial relationships, enabling complexreasoning like predicting collisions of vehicles from different directions.Current large multimodal models (LMMs), however, lack of this capability of 3Dspatial reasoning. This limitation stems from the scarcity of 3D training dataand the bias in current model designs toward 2D data. In this paper, wesystematically study the impact of 3D-informed data, architecture, and trainingsetups, introducing SpatialLLM, a large multi-modal model with advanced 3Dspatial reasoning abilities. To address data limitations, we develop two typesof 3D-informed training datasets: (1) 3D-informed probing data focused onobject's 3D location and orientation, and (2) 3D-informed conversation data forcomplex spatial relationships. Notably, we are the first to curate VQA datathat incorporate 3D orientation relationships on real images. Furthermore, wesystematically integrate these two types of training data with thearchitectural and training designs of LMMs, providing a roadmap for optimaldesign aimed at achieving superior 3D reasoning capabilities. Our SpatialLLMadvances machines toward highly capable 3D-informed reasoning, surpassingGPT-4o performance by 8.7%. Our systematic empirical design and the resultingfindings offer valuable insights for future research in this direction. Ourproject page is available at:https://3d-spatial-reasoning.github.io/spatial-llm/

Quick Read (beta)

loading the full paper ...