SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

Abstract

Humans can directly imagine and manipulate visual images in their minds, acapability known as spatial visualization. While multi-modal Large LanguageModels (MLLMs) support imagination-based reasoning, spatial visualizationremains insufficiently evaluated, typically embedded within broadermathematical and logical assessments. Existing evaluations often rely on IQtests or math competitions that may overlap with training data, compromisingassessment reliability. To this end, we introduce SpatialViz-Bench, acomprehensive multi-modal benchmark for spatial visualization with 12 tasksacross 4 sub-abilities, comprising 1,180 automatically generated problems. Ourevaluation of 33 state-of-the-art MLLMs not only reveals wide performancevariations and demonstrates the benchmark's strong discriminative power, butalso uncovers counter-intuitive findings: models show difficulty perceptionmisaligned with human intuition, exhibit dramatic 2Dto-3D performance cliffs,default to formulaic derivation over visualization, and paradoxically sufferperformance degradation from Chain-of-Thought prompting in open-source models.Through statistical and qualitative analysis of error types, SpatialViz-Benchdemonstrates that state-of-the-art MLLMs continue to exhibit deficiencies inspatial visualization tasks, thereby addressing a significant lacuna in thefield. The benchmark data and evaluation code are publicly available.

Quick Read (beta)

loading the full paper ...