CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image

Abstract

This work proposes a new generation-based 3D reconstruction method, namedCupid, that accurately infers the camera pose, 3D shape, and texture of anobject from a single 2D image. Cupid casts 3D reconstruction as a conditionalsampling process from a learned distribution of 3D objects, and it jointlygenerates voxels and pixel-voxel correspondences, enabling robust pose andshape estimation under a unified generative framework. By representing bothinput camera poses and 3D shape as a distribution in a shared 3D latent space,Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage thatproduces initial 3D geometry with associated 2D projections for pose recovery;and (2) a refinement stage that integrates pose-aligned image features toenhance structural fidelity and appearance details. Extensive experimentsdemonstrate Cupid outperforms leading 3D reconstruction methods with an over 3dB PSNR gain and an over 10% Chamfer Distance reduction, while matchingmonocular estimators on pose accuracy and delivering superior visual fidelityover baseline 3D generative models. For an immersive view of the 3D resultsgenerated by Cupid, please visit cupid3d.github.io.

Quick Read (beta)

loading the full paper ...