Abstract
Spatial intelligence (SI) represents a cognitive ability encompassing thevisualization, manipulation, and reasoning about spatial relationships,underpinning disciplines from neuroscience to robotics. We introduce SITE, abenchmark dataset towards SI Thorough Evaluation in a standardized format ofmulti-choice visual question-answering, designed to assess largevision-language models' spatial intelligence across diverse visual modalities(single-image, multi-image, and video) and SI factors (figural to environmentalscales, spatial visualization and orientation, intrinsic and extrinsic, staticand dynamic). Our approach to curating the benchmark combines a bottom-upsurvey about 31 existing datasets and a top-down strategy drawing upon threeclassification systems in cognitive science, which prompt us to design twonovel types of tasks about view-taking and dynamic scenes. Extensiveexperiments reveal that leading models fall behind human experts especially inspatial orientation, a fundamental SI factor. Moreover, we demonstrate apositive correlation between a model's spatial reasoning proficiency and itsperformance on an embodied AI task.