Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

  • 2024-11-05 14:33:41
  • Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, Lifu Wang, Zhuo Chen, Sicong Liu, Yuhong Liu, Yong Yang, Di Wang, Jie Jiang, Chunchao Guo
  • 0

Abstract

While 3D generative models have greatly improved artists' workflows, theexisting diffusion models for 3D generation suffer from slow generation andpoor generalization. To address this issue, we propose a two-stage approachnamed Hunyuan3D-1.0 including a lite version and a standard version, that bothsupport text- and image-conditioned generation. In the first stage, we employ amulti-view diffusion model that efficiently generates multi-view RGB inapproximately 4 seconds. These multi-view images capture rich details of the 3Dasset from different viewpoints, relaxing the tasks from single-view tomulti-view reconstruction. In the second stage, we introduce a feed-forwardreconstruction model that rapidly and faithfully reconstructs the 3D assetgiven the generated multi-view images in approximately 7 seconds. Thereconstruction network learns to handle noises and in-consistency introduced bythe multi-view diffusion and leverages the available information from thecondition image to efficiently recover the 3D structure. Our framework involvesthe text-to-image model, i.e., Hunyuan-DiT, making it a unified framework tosupport both text- and image-conditioned 3D generation. Our standard versionhas 3x more parameters than our lite and other existing model. OurHunyuan3D-1.0 achieves an impressive balance between speed and quality,significantly reducing generation time while maintaining the quality anddiversity of the produced assets.

 

Quick Read (beta)

loading the full paper ...