Abstract
Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interestin extending them to 3D settings for tasks like 3D Question Answering, DenseCaptioning, and Visual Grounding. Unlike 2D VLMs that typically process imagesthrough an image encoder, 3D scenes, with their intricate spatial structures,allow for diverse model architectures. Based on their encoder design, thispaper categorizes recent 3D VLMs into 3D object-centric, 2D image-based, and 3Dscene-centric approaches. Despite the architectural similarity of 3Dscene-centric VLMs to their 2D counterparts, they have exhibited comparativelylower performance compared with the latest 3D object-centric and 2D image-basedapproaches. To understand this gap, we conduct an in-depth analysis, revealingthat 3D scene-centric VLMs show limited reliance on the 3D scene encoder, andthe pre-train stage appears less effective than in 2D VLMs. Furthermore, weobserve that data scaling benefits are less pronounced on larger datasets. Ourinvestigation suggests that while these models possess cross-modal alignmentcapabilities, they tend to over-rely on linguistic cues and overfit to frequentanswer distributions, thereby diminishing the effective utilization of the 3Dencoder. To address these limitations and encourage genuine 3D sceneunderstanding, we introduce a novel 3D Relevance Discrimination QA datasetdesigned to disrupt shortcut learning and improve 3D understanding. Ourfindings highlight the need for advanced evaluation and improved strategies forbetter 3D understanding in 3D VLMs.