GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

Abstract

In recent years, 2D Vision-Language Models (VLMs) have made significantstrides in image-text understanding tasks. However, their performance in 3Dspatial comprehension, which is critical for embodied intelligence, remainslimited. Recent advances have leveraged 3D point clouds and multi-view imagesas inputs, yielding promising results. However, we propose exploring a purelyvision-based solution inspired by human perception, which merely relies onvisual cues for 3D spatial understanding. This paper empirically investigatesthe limitations of VLMs in 3D spatial knowledge, revealing that their primaryshortcoming lies in the lack of global-local correspondence between the sceneand individual frames. To address this, we introduce GPT4Scene, a novel visualprompting paradigm in VLM training and inference that helps build theglobal-local relationship, significantly improving the 3D spatial understandingof indoor scenes. Specifically, GPT4Scene constructs a 3D Bird's Eye View (BEV)image from the video and marks consistent object IDs across both frames and theBEV image. The model then inputs the concatenated BEV image and video frameswith markers. In zero-shot evaluations, GPT4Scene improves performance overclosed-source VLMs like GPT-4o. Additionally, we prepare a processed videodataset consisting of 165K text annotation to fine-tune open-source VLMs,achieving state-of-the-art performance on all 3D understanding tasks.Surprisingly, after training with the GPT4Scene paradigm, VLMs consistentlyimprove during inference, even without visual prompting and BEV image asexplicit correspondence. It demonstrates that the proposed paradigm helps VLMsdevelop an intrinsic ability to understand 3D scenes, which paves the way for anoninvasive approach to extending pre-trained VLMs for 3D scene understanding.

Quick Read (beta)

loading the full paper ...