Abstract
Recent advances in multimodal large language models (MLLMs) havesignificantly enhanced their capabilities; however, their spatial perceptionabilities remain a notable limitation. To address this challenge, multimodaldata synthesis offers a promising solution. Yet, ensuring that synthesized dataadhere to spatial common sense is a non-trivial task. In this work, weintroduce SKG2Data, a novel multimodal synthesis approach guided by spatialknowledge graphs, grounded in the concept of knowledge-to-data generation.SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulatehuman-like perception of spatial directions and distances, which issubsequently utilized to guide multimodal data synthesis. Extensive experimentsdemonstrate that data synthesized from diverse types of spatial knowledge,including direction and distance, not only enhance the spatial perception andreasoning abilities of MLLMs but also exhibit strong generalizationcapabilities. We hope that the idea of knowledge-based data synthesis canadvance the development of spatial intelligence.