Abstract
3D vision-language grounding, which focuses on aligning language with the 3Dphysical environment, stands as a cornerstone in the development of embodiedagents. In comparison to recent advancements in the 2D domain, groundinglanguage in 3D scenes faces several significant challenges: (i) the inherentcomplexity of 3D scenes due to the diverse object configurations, their richattributes, and intricate relationships; (ii) the scarcity of paired 3Dvision-language data to support grounded learning; and (iii) the absence of aunified learning framework to distill knowledge from grounded 3D data. In thiswork, we aim to address these three major challenges in 3D vision-language byexamining the potential of systematically upscaling 3D vision-language learningin indoor environments. We introduce the first million-scale 3D vision-languagedataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising2.5M vision-language pairs derived from both human annotations and our scalablescene-graph-based generation approach. We demonstrate that this scaling allowsfor a unified pre-training framework, Grounded Pre-training for Scenes (GPS),for 3D vision-language learning. Through extensive experiments, we showcase theeffectiveness of GPS by achieving state-of-the-art performance on all existing3D visual grounding benchmarks. The vast potential of SceneVerse and GPS isunveiled through zero-shot transfer experiments in the challenging 3Dvision-language tasks. Project website: https://scene-verse.github.io.