Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

Abstract

Generating high-quality 3D assets from text and images has long beenchallenging, primarily due to the absence of scalable 3D representationscapable of capturing intricate geometry distributions. In this work, weintroduce Direct3D, a native 3D generative model scalable to in-the-wild inputimages, without requiring a multiview diffusion model or SDS optimization. Ourapproach comprises two primary components: a Direct 3D Variational Auto-Encoder(D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficientlyencodes high-resolution 3D shapes into a compact and continuous latent triplanespace. Notably, our method directly supervises the decoded geometry using asemi-continuous surface sampling strategy, diverging from previous methodsrelying on rendered images as supervision signals. D3D-DiT models thedistribution of encoded 3D latents and is specifically designed to fusepositional information from the three feature maps of the triplane latent,enabling a native 3D generative model scalable to large-scale 3D datasets.Additionally, we introduce an innovative image-to-3D generation pipelineincorporating semantic and pixel-level image conditions, allowing the model toproduce 3D shapes consistent with the provided conditional image input.Extensive experiments demonstrate the superiority of our large-scalepre-trained Direct3D over previous image-to-3D approaches, achievingsignificantly better generation quality and generalization ability, thusestablishing a new state-of-the-art for 3D content creation. Project page:https://nju-3dv.github.io/projects/Direct3D/.

Quick Read (beta)

loading the full paper ...