Abstract
Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkablesuccess across various fields. However, there are few studies on 3D indoorscene generation with VLMs. This paper considers this task as a planningproblem subject to spatial and layout common sense constraints. To solve theproblem with a VLM, we propose a new global-local tree search algorithm.Globally, the method places each object sequentially and explores multipleplacements during each placement process, where the problem space isrepresented as a tree. To reduce the depth of the tree, we decompose the scenestructure hierarchically, i.e. room level, region level, floor object level,and supported object level. The algorithm independently generates the floorobjects in different regions and supported objects placed on different floorobjects. Locally, we also decompose the sub-task, the placement of each object,into multiple steps. The algorithm searches the tree of problem space. Toleverage the VLM model to produce positions of objects, we discretize thetop-down view space as a dense grid and fill each cell with diverse emojis tomake to cells distinct. We prompt the VLM with the emoji grid and the VLMproduces a reasonable location for the object by describing the position withthe name of emojis. The quantitative and qualitative experimental resultsillustrate our approach generates more plausible 3D scenes thanstate-of-the-art approaches. Our source code is available athttps://github.com/dw-dengwei/TreeSearchGen .