Abstract
Text-to-3D generation, which aims to synthesize vivid 3D objects from textprompts, has attracted much attention from the computer vision community. Whileseveral existing works have achieved impressive results for this task, theymainly rely on a time-consuming optimization paradigm. Specifically, thesemethods optimize a neural field from scratch for each text prompt, takingapproximately one hour or more to generate one object. This heavy andrepetitive training cost impedes their practical deployment. In this paper, wepropose a novel framework for fast text-to-3D generation, dubbed Instant3D.Once trained, Instant3D is able to create a 3D object for an unseen text promptin less than one second with a single run of a feedforward network. We achievethis remarkable speed by devising a new network that directly constructs a 3Dtriplane from a text prompt. The core innovation of our Instant3D lies in ourexploration of strategies to effectively inject text conditions into thenetwork. Furthermore, we propose a simple yet effective activation function,the scaled-sigmoid, to replace the original sigmoid function, which speeds upthe training convergence by more than ten times. Finally, to address the Janus(multi-head) problem in 3D generation, we propose an adaptive Perp-Negalgorithm that can dynamically adjust its concept negation scales according tothe severity of the Janus problem during training, effectively reducing themulti-head effect. Extensive experiments on a wide variety of benchmarkdatasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods both qualitatively and quantitatively, while achievingsignificantly better efficiency. The project page is athttps://ming1993li.github.io/Instant3DProj.