Abstract
Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, prior studies attempt to construct a spatial understanding via grid-based cognitive maps. However, current grid-based map methods rely on discretized representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework uses continuous object boundary coordinates to enable quantitative spatial computation, which effectively reduces ambiguity in natural language descriptions of spatial relationships. Specifically, our method comprises two stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage enhances the model's real-world generalization capabilities. Based on the above framework, we investigate factors that affect cognitive map accuracy and quantify its relationship with task performance. Evaluated on mainstream spatial reasoning benchmarks, our model, V2LO-7B, achieves an average improvement of 3.24\% over the model trained on grid maps, validating the superiority of our method.