Abstract
Humans possess the visual-spatial intelligence to remember spaces fromsequential visual observations. However, can Multimodal Large Language Models(MLLMs) trained on million-scale video datasets also ``think in space'' fromvideos? We present a novel video-based visual-spatial intelligence benchmark(VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibitcompetitive - though subhuman - visual-spatial intelligence. We probe models toexpress how they think in space both linguistically and visually and find thatwhile spatial reasoning capabilities remain the primary bottleneck for MLLMs toreach higher benchmark performance, local world models and spatial awareness doemerge within these models. Notably, prevailing linguistic reasoning techniques(e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improveperformance, whereas explicitly generating cognitive maps duringquestion-answering enhances MLLMs' spatial distance ability.