PulseCheck457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

Abstract

Although large multimodal models (LMMs) have demonstrated remarkablecapabilities in visual scene interpretation and reasoning, their capacity forcomplex and precise 3-dimensional spatial reasoning remains uncertain. Existingbenchmarks focus predominantly on 2D spatial understanding and lack a frameworkto comprehensively evaluate 6D spatial reasoning across varying complexities.To address this limitation, we present PulseCheck457, a scalable and unbiasedsynthetic dataset designed with 4 key capability for spatial reasoning:multi-object recognition, 2D location, 3D location, and 3D orientation. Wedevelop a cascading evaluation structure, constructing 7 question types across5 difficulty levels that range from basic single object recognition to our newproposed complex 6D spatial reasoning tasks. We evaluated various largemultimodal models (LMMs) on PulseCheck457, observing a general decline inperformance as task complexity increases, particularly in 3D reasoning and 6Dspatial tasks. To quantify these challenges, we introduce the RelativePerformance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoningcapabilities. Leveraging the unbiased attribute design of our dataset, we alsouncover prediction biases across different attributes, with similar patternsobserved in real-world image settings.

Quick Read (beta)

loading the full paper ...