Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) havesignificantly enhanced performance on 2D visual tasks. However, improving theirspatial intelligence remains a challenge. Existing 3D MLLMs always rely onadditional 3D or 2.5D data to incorporate spatial awareness, restricting theirutility in scenarios with only 2D inputs, such as images or videos. In thispaper, we present Spatial-MLLM, a novel framework for visual-based spatialreasoning from purely 2D observations. Unlike conventional video MLLMs whichrely on CLIP-based visual encoders optimized for semantic understanding, ourkey insight is to unleash the strong structure prior from the feed-forwardvisual geometry foundation model. Specifically, we propose a dual-encoderarchitecture: a pretrained 2D visual encoder to extract semantic features, anda spatial encoder-initialized from the backbone of the visual geometry model-toextract 3D structure features. A connector then integrates both features intounified visual tokens for enhanced spatial understanding. Furthermore, wepropose a space-aware frame sampling strategy at inference time, which selectsthe spatially informative frames of a video sequence, ensuring that even underlimited token length, the model focuses on frames critical for spatialreasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120kdataset and train the model on it using supervised fine-tuning and GRPO.Extensive experiments on various real-world datasets demonstrate that ourspatial-MLLM achieves state-of-the-art performance in a wide range ofvisual-based spatial understanding and reasoning tasks. Project page:https://diankun-wu.github.io/Spatial-MLLM/.

Quick Read (beta)

loading the full paper ...