The goal of many computer vision systems is to transform image pixels into 3Drepresentations. Recent popular models use neural networks to regress directlyfrom pixels to 3D object parameters. Such an approach works well whensupervision is available, but in problems like human pose and shape estimation,it is difficult to obtain natural images with 3D ground truth. To go one stepfurther, we propose a new architecture that facilitates unsupervised, orlightly supervised, learning. The idea is to break the problem into a series oftransformations between increasingly abstract representations. Each stepinvolves a cycle designed to be learnable without annotated training data, andthe chain of cycles delivers the final solution. Specifically, we use 2D bodypart segments as an intermediate representation that contains enoughinformation to be lifted to 3D, and at the same time is simple enough to belearned in an unsupervised way. We demonstrate the method by learning 3D humanpose and shape from un-paired and un-annotated images. We also explore varyingamounts of paired data and show that cycling greatly alleviates the need forpaired data. While we present results for modeling humans, our formulation isgeneral and can be applied to other vision problems.