Abstract
This paper introduces MIDI, a novel paradigm for compositional 3D scenegeneration from a single image. Unlike existing methods that rely onreconstruction or retrieval techniques or recent approaches that employmulti-stage object-by-object generation, MIDI extends pre-trained image-to-3Dobject generation models to multi-instance diffusion models, enabling thesimultaneous generation of multiple 3D instances with accurate spatialrelationships and high generalizability. At its core, MIDI incorporates a novelmulti-instance attention mechanism, that effectively captures inter-objectinteractions and spatial coherence directly within the generation process,without the need for complex multi-step processes. The method utilizes partialobject images and global scene context as inputs, directly modeling objectcompletion during 3D generation. During training, we effectively supervise theinteractions between 3D instances using a limited amount of scene-level data,while incorporating single-object data for regularization, thereby maintainingthe pre-trained generalization ability. MIDI demonstrates state-of-the-artperformance in image-to-scene generation, validated through evaluations onsynthetic data, real-world scene data, and stylized scene images generated bytext-to-image diffusion models.