Abstract
We present DenseRaC, a novel end-to-end framework for jointly estimating 3Dhuman pose and body shape from a monocular RGB image. Our two-step frameworktakes the body pixel-to-surface correspondence map (i.e., IUV map) as proxyrepresentation and then performs estimation of parameterized human pose andshape. Specifically, given an estimated IUV map, we develop a deep neuralnetwork optimizing 3D body reconstruction losses and further integrating arender-and-compare scheme to minimize differences between the input and therendered output, i.e., dense body landmarks, body part masks, and adversarialpriors. To boost learning, we further construct a large-scale synthetic dataset(MOCA) utilizing web-crawled Mocap sequences, 3D scans and animations. Thegenerated data covers diversified camera views, human actions and body shapes,and is paired with full ground truth. Our model jointly learns to represent the3D human body from hybrid datasets, mitigating the problem of unpaired trainingdata. Our experiments show that DenseRaC obtains superior performance againststate of the art on public benchmarks of various humanrelated tasks.