STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

Abstract

Despite rapid progress in Multi-modal Large Language Models and LargeAudio-Language Models, existing audio benchmarks largely test semantics thatcan be recovered from text captions, masking deficits in fine-grainedperceptual reasoning. We formalize audio 4D intelligence that is defined asreasoning over sound dynamics in time and 3D space, and introduce STAR-Bench tomeasure it. STAR-Bench combines a Foundational Acoustic Perception setting (sixattributes under absolute and relative regimes) with a Holistic Spatio-TemporalReasoning setting that includes segment reordering for continuous and discreteprocesses and spatial tasks spanning static localization, multi-sourcerelations, and dynamic trajectories. Our data curation pipeline uses twomethods to ensure high-quality samples. For foundational tasks, we useprocedurally synthesized and physics-simulated audio. For holistic data, wefollow a four-stage process that includes human annotation and final selectionbased on human performance. Unlike prior benchmarks where caption-onlyanswering reduces accuracy slightly, STAR-Bench induces far larger drops(-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguisticallyhard-to-describe cues. Evaluating 19 models reveals substantial gaps comparedwith humans and a capability hierarchy: closed-source models are bottleneckedby fine-grained perception, while open-source models lag across perception,knowledge, and reasoning. Our STAR-Bench provides critical insights and a clearpath forward for developing future models with a more robust understanding ofthe physical world.

Quick Read (beta)

loading the full paper ...