Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

  • 2024-05-23 18:49:37
  • Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao
  • 0

Abstract

Generating high-quality 3D assets from text and images has long beenchallenging, primarily due to the absence of scalable 3D representationscapable of capturing intricate geometry distributions. In this work, weintroduce Direct3D, a native 3D generative model scalable to in-the-wild inputimages, without requiring a multiview diffusion model or SDS optimization. Ourapproach comprises two primary components: a Direct 3D Variational Auto-Encoder(D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficientlyencodes high-resolution 3D shapes into a compact and continuous latent triplanespace. Notably, our method directly supervises the decoded geometry using asemi-continuous surface sampling strategy, diverging from previous methodsrelying on rendered images as supervision signals. D3D-DiT models thedistribution of encoded 3D latents and is specifically designed to fusepositional information from the three feature maps of the triplane latent,enabling a native 3D generative model scalable to large-scale 3D datasets.Additionally, we introduce an innovative image-to-3D generation pipelineincorporating semantic and pixel-level image conditions, allowing the model toproduce 3D shapes consistent with the provided conditional image input.Extensive experiments demonstrate the superiority of our large-scalepre-trained Direct3D over previous image-to-3D approaches, achievingsignificantly better generation quality and generalization ability, thusestablishing a new state-of-the-art for 3D content creation. Project page:https://nju-3dv.github.io/projects/Direct3D/.

 

Quick Read (beta)

loading the full paper ...