Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Abstract

Realistic 3D indoor scene synthesis is vital for embodied AI and digitalcontent creation. It can be naturally divided into two subtasks: objectgeneration and layout generation. While recent generative models havesignificantly advanced object-level quality and controllability, layoutgeneration remains challenging due to limited datasets. Existing methods eitheroverfit to these datasets or rely on predefined constraints to optimizenumerical layout that sacrifice flexibility. As a result, they fail to generatescenes that are both open-vocabulary and aligned with fine-grained userinstructions. We introduce DirectLayout, a framework that directly generatesnumerical 3D layouts from text descriptions using generalizable spatialreasoning of large language models (LLMs). DirectLayout decomposes thegeneration into three stages: producing a Bird's-Eye View (BEV) layout, liftingit into 3D space, and refining object placements. To enable explicit spatialreasoning and help the model grasp basic principles of object placement, weemploy Chain-of-Thought (CoT) Activation based on the 3D-Front dataset.Additionally, we design CoT-Grounded Generative Layout Reward to enhancegeneralization and spatial planning. During inference, DirectLayout addressesasset-layout mismatches via Iterative Asset-Layout Alignment through in-contextlearning. Extensive experiments demonstrate that DirectLayout achievesimpressive semantic consistency, generalization and physical plausibility.

Quick Read (beta)

loading the full paper ...