Abstract
Understanding perspective is fundamental to human visual perception, yet theextent to which multimodal large language models (MLLMs) internalizeperspective geometry remains unclear. We introduce MMPerspective, the firstbenchmark specifically designed to systematically evaluate MLLMs' understandingof perspective through 10 carefully crafted tasks across three complementarydimensions: Perspective Perception, Reasoning, and Robustness. Our benchmarkcomprises 2,711 real-world and synthetic image instances with 5,083question-answer pairs that probe key capabilities, such as vanishing pointperception and counting, perspective type reasoning, line relationshipunderstanding in 3D space, invariance to perspective-preservingtransformations, etc. Through a comprehensive evaluation of 43 state-of-the-artMLLMs, we uncover significant limitations: while models demonstrate competenceon surface-level perceptual tasks, they struggle with compositional reasoningand maintaining spatial consistency under perturbations. Our analysis furtherreveals intriguing patterns between model architecture, scale, and perspectivecapabilities, highlighting both robustness bottlenecks and the benefits ofchain-of-thought prompting. MMPerspective establishes a valuable testbed fordiagnosing and advancing spatial understanding in vision-language systems.Resources available at: https://yunlong10.github.io/MMPerspective/