Abstract
Multi-modal large language models (MLLMs) have rapidly advanced in visualtasks, yet their spatial understanding remains limited to single images,leaving them ill-suited for robotics and other real-world applications thatrequire multi-frame reasoning. In this paper, we propose a framework to equipMLLMs with robust multi-frame spatial understanding by integrating depthperception, visual correspondence, and dynamic perception. Central to ourapproach is the MultiSPA dataset, a novel, large-scale collection of more than27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, weintroduce a comprehensive benchmark that tests a wide spectrum of spatial tasksunder uniform metrics. Our resulting model, Multi-SpatialMLLM, achievessignificant gains over baselines and proprietary systems, demonstratingscalable, generalizable multi-frame reasoning. We further observe multi-taskbenefits and early indications of emergent capabilities in challengingscenarios, and showcase how our model can serve as a multi-frame rewardannotator for robotics.