Abstract
Given a single in-the-wild human photo, it remains a challenging task toreconstruct a high-fidelity 3D human model. Existing methods face difficultiesincluding a) the varying body proportions captured by in-the-wild human images;b) diverse personal belongings within the shot; and c) ambiguities in humanpostures and inconsistency in human textures. In addition, the scarcity ofhigh-quality human data intensifies the challenge. To address these problems,we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbedGeneMAN, building upon a comprehensive multi-source collection of high-qualityhuman data, including 3D scans, multi-view videos, single photos, and ourgenerated synthetic human data. GeneMAN encompasses three key modules. 1)Without relying on parametric human models (e.g., SMPL), GeneMAN first trains ahuman-specific text-to-image diffusion model and a view-conditioned diffusionmodel, serving as GeneMAN 2D human prior and 3D human prior for reconstruction,respectively. 2) With the help of the pretrained human prior models, theGeometry Initialization-&-Sculpting pipeline is leveraged to recoverhigh-quality 3D human geometry given a single image. 3) To achievehigh-fidelity 3D human textures, GeneMAN employs the Multi-Space TextureRefinement pipeline, consecutively refining textures in the latent and thepixel spaces. Extensive experimental results demonstrate that GeneMAN couldgenerate high-quality 3D human models from a single image input, outperformingprior state-of-the-art methods. Notably, GeneMAN could reveal much bettergeneralizability in dealing with in-the-wild images, often yieldinghigh-quality 3D human models in natural poses with common items, regardless ofthe body proportions in the input images.