Abstract
Generating high-quality 3D assets from textual descriptions remains a pivotalchallenge in computer graphics and vision research. Due to the scarcity of 3Ddata, state-of-the-art approaches utilize pre-trained 2D diffusion priors,optimized through Score Distillation Sampling (SDS). Despite progress, craftingcomplex 3D scenes featuring multiple objects or intricate interactions is stilldifficult. To tackle this, recent methods have incorporated box or layoutguidance. However, these layout-guided compositional methods often struggle toprovide fine-grained control, as they are generally coarse and lackexpressiveness. To overcome these challenges, we introduce a novel SDSapproach, Semantic Score Distillation Sampling (SemanticSDS), designed toeffectively improve the expressiveness and accuracy of compositional text-to-3Dgeneration. Our approach integrates new semantic embeddings that maintainconsistency across different rendering views and clearly differentiate betweenvarious objects and parts. These embeddings are transformed into a semanticmap, which directs a region-specific SDS process, enabling precise optimizationand compositional generation. By leveraging explicit semantic guidance, ourmethod unlocks the compositional capabilities of existing pre-trained diffusionmodels, thereby achieving superior quality in 3D content generation,particularly for complex objects and scenes. Experimental results demonstratethat our SemanticSDS framework is highly effective for generatingstate-of-the-art complex 3D content. Code:https://github.com/YangLing0818/SemanticSDS-3D