Abstract
Building a foundation model for 3D vision is a complex challenge that remainsunsolved. Towards that goal, it is important to understand the 3D reasoningcapabilities of current models as well as identify the gaps between thesemodels and humans. Therefore, we construct a new 3D visual understandingbenchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in theVisual Question Answering (VQA) format. We evaluate state-of-the-artVision-Language Models (VLMs), specialized models, and human subjects on it.Our results show that VLMs generally perform poorly, while the specializedmodels are accurate but not robust, failing under geometric perturbations. Incontrast, human vision continues to be the most reliable 3D visual system. Wefurther demonstrate that neural networks align more closely with human 3Dvision mechanisms compared to classical computer vision methods, andTransformer-based networks such as ViT align more closely with human 3D visionmechanisms than CNNs. We hope our study will benefit the future development offoundation models for 3D vision. Code is available athttps://github.com/princeton-vl/UniQA-3D .