Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Abstract

Multi-view understanding, the ability to reconcile visual information acrossdiverse viewpoints for effective navigation, manipulation, and 3D scenecomprehension, is a fundamental challenge in Multi-Modal Large Language Models(MLLMs) to be used as embodied agents. While recent MLLMs have shown impressiveadvances in high-level reasoning and planning, they frequently fall short whenconfronted with multi-view geometric consistency and cross-view correspondence.To comprehensively evaluate the challenges of MLLMs in multi-view scenereasoning, we propose All-Angles Bench, a benchmark of over 2,100 humancarefully annotated multi-view question-answer pairs across 90 diversereal-world scenes. Our six tasks (counting, attribute identification, relativedistance, relative direction, object manipulation, and camera pose estimation)specifically test model's geometric correspondence and the capacity to aligninformation consistently across views. Our extensive experiments, benchmark on27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, andGPT-4o against human evaluators reveals a substantial performance gap,indicating that current MLLMs remain far from human-level proficiency. Throughin-depth analysis, we show that MLLMs are particularly underperforming undertwo aspects: (1) cross-view correspondence for partially occluded views and (2)establishing the coarse camera poses. These findings highlight the necessity ofdomain-specific refinements or modules that embed stronger multi-viewawareness. We believe that our All-Angles Bench offers valuable insights andcontribute to bridging the gap between MLLMs and human-level multi-viewunderstanding. The project and benchmark are publicly available athttps://danielchyeh.github.io/All-Angles-Bench/.

Quick Read (beta)

loading the full paper ...