Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Abstract

Recent advances in 2D image generation have achieved remarkablequality,largely driven by the capacity of diffusion models and the availabilityof large-scale datasets. However, direct 3D generation is still constrained bythe scarcity and lower fidelity of 3D datasets. In this paper, we introduceZero-1-to-G, a novel approach that addresses this problem by enabling directsingle-view generation on Gaussian splats using pretrained 2D diffusion models.Our key insight is that Gaussian splats, a 3D representation, can be decomposedinto multi-view images encoding different attributes. This reframes thechallenging task of direct 3D generation within a 2D diffusion framework,allowing us to leverage the rich priors of pretrained 2D diffusion models. Toincorporate 3D awareness, we introduce cross-view and cross-attribute attentionlayers, which capture complex correlations and enforce 3D consistency acrossgenerated splats. This makes Zero-1-to-G the first direct image-to-3Dgenerative model to effectively utilize pretrained 2D diffusion priors,enabling efficient training and improved generalization to unseen objects.Extensive experiments on both synthetic and in-the-wild datasets demonstratesuperior performance in 3D object generation, offering a new approach tohigh-quality 3D generation.

Quick Read (beta)

loading the full paper ...