Abstract
Vision Language Models (VLMs) have achieved impressive performance in 2Dimage understanding, however they are still struggling with spatialunderstanding which is the foundation of Embodied AI. In this paper, we proposeSpatialBot for better spatial understanding by feeding both RGB and depthimages. Additionally, we have constructed the SpatialQA dataset, which involvesmulti-level depth-related questions to train VLMs for depth understanding.Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilitiesin spatial understanding at different levels. Extensive experiments on ourspatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks,demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. Themodel, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.