Abstract
New era has unlocked exciting possibilities for extending Large LanguageModels (LLMs) to tackle 3D vision-language tasks. However, most existing 3Dmultimodal LLMs (MLLMs) rely on compressing holistic 3D scene information orsegmenting independent objects to perform these tasks, which limits theirspatial awareness due to insufficient representation of the richness inherentin 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3DMLLM specifically designed to enhance spatial awareness for 3D vision-languagetasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLMintegrates an LLM backbone with a progressive spatial awareness scheme thatprogressively captures spatial information as the perception field expands,generating location-enriched 3D scene embeddings to serve as visual prompts.Furthermore, we introduce two novel tasks: 3D object distance measurement and3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluatethe model's spatial awareness capabilities. Experimental results demonstratethat Spatial 3D-LLM achieves state-of-the-art performance across a wide rangeof 3D vision-language tasks, revealing the improvements stemmed from ourprogressive spatial awareness scheme of mining more profound spatialinformation. Our code is available athttps://github.com/bjshuyuan/Spatial-3D-LLM.