Abstract
Text-guided 3D shape generation remains challenging due to the absence oflarge paired text-shape data, the substantial semantic gap between these twomodalities, and the structural complexity of 3D shapes. This paper presents anew framework called Image as Stepping Stone (ISS) for the task by introducing2D image as a stepping stone to connect the two modalities and to eliminate theneed for paired text-shape data. Our key contribution is a two-stagefeature-space-alignment approach that maps CLIP features to shapes byharnessing a pre-trained single-view reconstruction (SVR) model with multi-viewsupervisions: first map the CLIP image feature to the detail-rich shape spacein the SVR model, then map the CLIP text feature to the shape space andoptimize the mapping by encouraging CLIP consistency between the input text andthe rendered images. Further, we formulate a text-guided shape stylizationmodule to dress up the output shapes with novel textures. Beyond existing workson 3D shape generation from text, our new approach is general for creatingshapes in a broad range of categories, without requiring paired text-shapedata. Experimental results manifest that our approach outperforms thestate-of-the-arts and our baselines in terms of fidelity and consistency withtext. Further, our approach can stylize the generated shapes with bothrealistic and fantasy structures and textures.