SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

Abstract

Reasoning about motion and space is a fundamental cognitive capability thatis required by multiple real-world applications. While many studies highlightthat large multimodal language models (MLMs) struggle to reason about space,they only focus on static spatial relationships, and not dynamic awareness ofmotion and space, i.e., reasoning about the effect of egocentric and objectmotions on spatial relationships. Manually annotating such object and cameramovements is expensive. Hence, we introduce SAT, a simulated spatial aptitudetraining dataset comprising both static and dynamic spatial reasoning across175K question-answer (QA) pairs and 20K scenes. Complementing this, we alsoconstruct a small (150 image-QAs) yet challenging dynamic spatial test setusing real-world images. Leveraging our SAT datasets and 6 existing staticspatial benchmarks, we systematically investigate what improves both static anddynamic spatial awareness. Our results reveal that simulations are surprisinglyeffective at imparting spatial aptitude to MLMs that translate to real images.We show that perfect annotations in simulation are more effective than existingapproaches of pseudo-annotating real images. For instance, SAT trainingimproves a LLaVA-13B model by an average 11% and a LLaVA-Video-7B model by anaverage 8% on multiple spatial benchmarks, including our real-image dynamictest set and spatial reasoning on long videos -- even outperforming some largeproprietary models. While reasoning over static relationships improves withsynthetic training data, there is still considerable room for improvement fordynamic reasoning questions.

Quick Read (beta)

loading the full paper ...