Abstract
Synthesizing interactive 3D scenes from text is essential for gaming, virtualreality, and embodied AI. However, existing methods face several challenges.Learning-based approaches depend on small-scale indoor datasets, limiting thescene diversity and layout complexity. While large language models (LLMs) canleverage diverse text-domain knowledge, they struggle with spatial realism,often producing unnatural object placements that fail to respect common sense.Our key insight is that vision perception can bridge this gap by providingrealistic spatial guidance that LLMs lack. To this end, we introduceScenethesis, a training-free agentic framework that integrates LLM-based sceneplanning with vision-guided layout refinement. Given a text prompt, Scenethesisfirst employs an LLM to draft a coarse layout. A vision module then refines itby generating an image guidance and extracting scene structure to captureinter-object relations. Next, an optimization module iteratively enforcesaccurate pose alignment and physical plausibility, preventing artifacts likeobject penetration and instability. Finally, a judge module verifies spatialcoherence. Comprehensive experiments show that Scenethesis generates diverse,realistic, and physically plausible 3D interactive scenes, making it valuablefor virtual content creation, simulation environments, and embodied AIresearch.