Abstract
Animatable 3D human reconstruction from a single image is a challengingproblem due to the ambiguity in decoupling geometry, appearance, anddeformation. Recent advances in 3D human reconstruction mainly focus on statichuman modeling, and the reliance of using synthetic 3D scans for traininglimits their generalization ability. Conversely, optimization-based videomethods achieve higher fidelity but demand controlled capture conditions andcomputationally intensive refinement processes. Motivated by the emergence oflarge reconstruction models for efficient static reconstruction, we propose LHM(Large Animatable Human Reconstruction Model) to infer high-fidelity avatarsrepresented as 3D Gaussian splatting in a feed-forward pass. Our modelleverages a multimodal transformer architecture to effectively encode the humanbody positional features and image features with attention mechanism, enablingdetailed preservation of clothing geometry and texture. To further boost theface identity preservation and fine detail recovery, we propose a head featurepyramid encoding scheme to aggregate multi-scale features of the head regions.Extensive experiments demonstrate that our LHM generates plausible animatablehuman in seconds without post-processing for face and hands, outperformingexisting methods in both reconstruction accuracy and generalization ability.