Abstract
Reconstructing an animatable 3D human from casually captured images of anarticulated subject without camera or human pose information is a practical yetchallenging task due to view misalignment, occlusions, and the absence ofstructural priors. While optimization-based methods can produce high-fidelityresults from monocular or multi-view videos, they require accurate poseestimation and slow iterative optimization, limiting scalability inunconstrained scenarios. Recent feed-forward approaches enable efficientsingle-image reconstruction but struggle to effectively leverage multiple inputimages to reduce ambiguity and improve reconstruction accuracy. To addressthese challenges, we propose PF-LHM, a large human reconstruction model thatgenerates high-quality 3D avatars in seconds from one or multiple casuallycaptured pose-free images. Our approach introduces an efficient Encoder-DecoderPoint-Image Transformer architecture, which fuses hierarchical geometric pointfeatures and multi-view image features through multimodal attention. The fusedfeatures are decoded to recover detailed geometry and appearance, representedusing 3D Gaussian splats. Extensive experiments on both real and syntheticdatasets demonstrate that our method unifies single- and multi-image 3D humanreconstruction, achieving high-fidelity and animatable 3D human avatars withoutrequiring camera and human pose annotations. Code and models will be releasedto the public.