This paper addresses the problem of 3D human pose estimation in the wild. Asignificant challenge is the lack of training data, i.e., 2D images of humansannotated with 3D poses. Such data is necessary to train state-of-the-art CNNarchitectures. Here, we propose a solution to generate a large set ofphotorealistic synthetic images of humans with 3D pose annotations. Weintroduce an image-based synthesis engine that artificially augments a datasetof real images with 2D human pose annotations using 3D motion capture data.Given a candidate 3D pose, our algorithm selects for each joint an image whose2D pose locally matches the projected 3D pose. The selected images are thencombined to generate a new synthetic image by stitching local image patches ina kinematically constrained manner. The resulting images are used to train anend-to-end CNN for full-body 3D pose estimation. We cluster the training datainto a large number of pose classes and tackle pose estimation as a $K$-wayclassification problem. Such an approach is viable only with large trainingsets such as ours. Our method outperforms most of the published works in termsof 3D pose estimation in controlled environments (Human3.6M) and showspromising results for real-world images (LSP). This demonstrates that CNNstrained on artificial images generalize well to real images. Compared to datagenerated from more classical rendering engines, our synthetic images do notrequire any domain adaptation or fine-tuning stage.


