Abstract
We present a method for learning 3D spatial relationships between objectpairs, referred to as object-object spatial relationships (OOR), by leveragingsynthetically generated 3D samples from pre-trained 2D diffusion models. Wehypothesize that images synthesized by 2D diffusion models inherently captureplausible and realistic OOR cues, enabling efficient ways to collect a 3Ddataset to learn OOR for various unbounded object categories. Our approachbegins by synthesizing diverse images that capture plausible OOR cues, which wethen uplift into 3D samples. Leveraging our diverse collection of plausible 3Dsamples for the object pairs, we train a score-based OOR diffusion model tolearn the distribution of their relative spatial relationships. Additionally,we extend our pairwise OOR to multi-object OOR by enforcing consistency acrosspairwise relations and preventing object collisions. Extensive experimentsdemonstrate the robustness of our method across various object-object spatialrelationships, along with its applicability to real-world 3D scene arrangementtasks using the OOR diffusion model.