Abstract
While 3D generative models have greatly improved artists' workflows, theexisting diffusion models for 3D generation suffer from slow generation andpoor generalization. To address this issue, we propose a two-stage approachnamed Hunyuan3D-1.0 including a lite version and a standard version, that bothsupport text- and image-conditioned generation. In the first stage, we employ amulti-view diffusion model that efficiently generates multi-view RGB inapproximately 4 seconds. These multi-view images capture rich details of the 3Dasset from different viewpoints, relaxing the tasks from single-view tomulti-view reconstruction. In the second stage, we introduce a feed-forwardreconstruction model that rapidly and faithfully reconstructs the 3D assetgiven the generated multi-view images in approximately 7 seconds. Thereconstruction network learns to handle noises and in-consistency introduced bythe multi-view diffusion and leverages the available information from thecondition image to efficiently recover the 3D structure. Our framework involvesthe text-to-image model, i.e., Hunyuan-DiT, making it a unified framework tosupport both text- and image-conditioned 3D generation. Our standard versionhas 3x more parameters than our lite and other existing model. OurHunyuan3D-1.0 achieves an impressive balance between speed and quality,significantly reducing generation time while maintaining the quality anddiversity of the produced assets.