Abstract
Spatial reasoning is a core component of human cognition, enablingindividuals to perceive, comprehend, and interact with the physical world. Itrelies on a nuanced understanding of spatial structures and inter-objectrelationships, serving as the foundation for complex reasoning anddecision-making. To investigate whether current vision-language models (VLMs)exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmarkconsisting of 1,100 carefully curated real-world images with high spatialcomplexity. Based on this dataset, we design five tasks to rigorously evaluateVLMs' spatial perception, structural understanding, and reasoning capabilities,while deliberately minimizing reliance on domain-specific knowledge to betterisolate and assess the general spatial reasoning capability. We conduct acomprehensive evaluation across 24 state-of-the-art VLMs. The results show thateven the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracyand performs particularly poorly on the Order Generation task, with only 30.00%accuracy, far below the performance exceeding 90% achieved by humanparticipants. This persistent gap underscores the need for continued progress,positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark foradvancing spatial reasoning research in VLMs. Our project page is athttps://zesen01.github.io/jigsaw-puzzles.