Towards Foundation Models for 3D Vision: How Close Are We?

Abstract

Building a foundation model for 3D vision is a complex challenge that remainsunsolved. Towards that goal, it is important to understand the 3D reasoningcapabilities of current models as well as identify the gaps between thesemodels and humans. Therefore, we construct a new 3D visual understandingbenchmark that covers fundamental 3D vision tasks in the Visual QuestionAnswering (VQA) format. We evaluate state-of-the-art Vision-Language Models(VLMs), specialized models, and human subjects on it. Our results show thatVLMs generally perform poorly, while the specialized models are accurate butnot robust, failing under geometric perturbations. In contrast, human visioncontinues to be the most reliable 3D visual system. We further demonstrate thatneural networks align more closely with human 3D vision mechanisms compared toclassical computer vision methods, and Transformer-based networks such as ViTalign more closely with human 3D vision mechanisms than CNNs. We hope our studywill benefit the future development of foundation models for 3D vision.

Quick Read (beta)

loading the full paper ...