Abstract
We propose a 3D generation pipeline that uses diffusion models to generaterealistic human digital avatars. Due to the wide variety of human identities,poses, and stochastic details, the generation of 3D human meshes has been achallenging problem. To address this, we decompose the problem into 2D normalmap generation and normal map-based 3D reconstruction. Specifically, we firstsimultaneously generate realistic normal maps for the front and backside of aclothed human, dubbed dual normal maps, using a pose-conditional diffusionmodel. For 3D reconstruction, we "carve" the prior SMPL-X mesh to a detailed 3Dmesh according to the normal maps through mesh optimization. To further enhancethe high-frequency details, we present a diffusion resampling scheme on bothbody and facial regions, thus encouraging the generation of realistic digitalavatars. We also seamlessly incorporate a recent text-to-image diffusion modelto support text-based human identity control. Our method, namely, Chupa, iscapable of generating realistic 3D clothed humans with better perceptualquality and identity variety.