Abstract
3D human reconstruction and animation are long-standing topics in computergraphics and vision. However, existing methods typically rely on sophisticateddense-view capture and/or time-consuming per-subject optimization procedures.To address these limitations, we propose HumanRAM, a novel feed-forwardapproach for generalizable human reconstruction and animation from monocular orsparse human images. Our approach integrates human reconstruction and animationinto a unified framework by introducing explicit pose conditions, parameterizedby a shared SMPL-X neural texture, into transformer-based large reconstructionmodels (LRM). Given monocular or sparse input images with associated cameraparameters and SMPL-X poses, our model employs scalable transformers and aDPT-based decoder to synthesize realistic human renderings under novelviewpoints and novel poses. By leveraging the explicit pose conditions, ourmodel simultaneously enables high-quality human reconstruction andhigh-fidelity pose-controlled animation. Experiments show that HumanRAMsignificantly surpasses previous methods in terms of reconstruction accuracy,animation fidelity, and generalization performance on real-world datasets.Video results are available at https://zju3dv.github.io/humanram/.