Abstract
Generating unbounded 3D scenes is crucial for large-scale scene understandingand simulation. Urban scenes, unlike natural landscapes, consist of variouscomplex man-made objects and structures such as roads, traffic signs, vehicles,and buildings. To create a realistic and detailed urban scene, it is crucial toaccurately represent the geometry and semantics of the underlying objects,going beyond their visual appearance. In this work, we propose UrbanDiffusion,a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map andgenerates an urban scene with geometry and semantics in the form of semanticoccupancy map. Our model introduces a novel paradigm that learns the datadistribution of scene-level structures within a latent space and furtherenables the expansion of the synthesized scene into an arbitrary scale. Aftertraining on real-world driving datasets, our model can generate a wide range ofdiverse urban scenes given the BEV maps from the held-out set and alsogeneralize to the synthesized maps from a driving simulator. We furtherdemonstrate its application to scene image synthesis with a pretrained imagegenerator as a prior.