Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

  • 2025-08-25 15:55:22
  • Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang
  • 0

Abstract

Spatial reasoning is a core component of human cognition, enablingindividuals to perceive, comprehend, and interact with the physical world. Itrelies on a nuanced understanding of spatial structures and inter-objectrelationships, serving as the foundation for complex reasoning anddecision-making. To investigate whether current vision-language models (VLMs)exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmarkconsisting of 1,100 carefully curated real-world images with high spatialcomplexity. Based on this dataset, we design five tasks to rigorously evaluateVLMs' spatial perception, structural understanding, and reasoning capabilities,while deliberately minimizing reliance on domain-specific knowledge to betterisolate and assess the general spatial reasoning capability. We conduct acomprehensive evaluation across 24 state-of-the-art VLMs. The results show thateven the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracyand performs particularly poorly on the Order Generation task, with only 30.00%accuracy, far below the performance exceeding 90% achieved by humanparticipants. This persistent gap underscores the need for continued progress,positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark foradvancing spatial reasoning research in VLMs. Our project page is athttps://zesen01.github.io/jigsaw-puzzles.

 

Quick Read (beta)

loading the full paper ...