Abstract
The creation of complex 3D scenes tailored to user specifications has been atedious and challenging task with traditional 3D modeling tools. Although somepioneering methods have achieved automatic text-to-3D generation, they aregenerally limited to small-scale scenes with restricted control over the shapeand texture. We introduce SceneCraft, a novel method for generating detailedindoor scenes that adhere to textual descriptions and spatial layoutpreferences provided by users. Central to our method is a rendering-basedtechnique, which converts 3D semantic layouts into multi-view 2D proxy maps.Furthermore, we design a semantic and depth conditioned diffusion model togenerate multi-view images, which are used to learn a neural radiance field(NeRF) as the final scene representation. Without the constraints of panoramaimage generation, we surpass previous methods in supporting complicated indoorspace generation beyond a single room, even as complicated as a wholemulti-bedroom apartment with irregular shapes and layouts. Through experimentalanalysis, we demonstrate that our method significantly outperforms existingapproaches in complex indoor scene generation with diverse textures, consistentgeometry, and realistic visual quality. Code and more results are available at:https://orangesodahub.github.io/SceneCraft