Abstract
We learn a self-supervised, single-view 3D reconstruction model that predictsthe 3D mesh shape, texture and camera pose of a target object with a collectionof 2D images and silhouettes. The proposed method does not necessitate 3Dsupervision, manually annotated keypoints, multi-view images of an object or aprior 3D template. The key insight of our work is that objects can berepresented as a collection of deformable parts, and each part is semanticallycoherent across different instances of the same category (e.g., wings on birdsand wheels on cars). Therefore, by leveraging self-supervisedly learned partsegmentation of a large collection of category-specific images, we caneffectively enforce semantic consistency between the reconstructed meshes andthe original images. This significantly reduces ambiguities during jointprediction of shape and camera pose of an object, along with texture. To thebest of our knowledge, we are the first to try and solve the single-viewreconstruction problem without a category-specific template mesh or semantickeypoints. Thus our model can easily generalize to various object categorieswithout such labels, e.g., horses, penguins, etc. Through a variety ofexperiments on several categories of deformable and rigid objects, wedemonstrate that our unsupervised method performs comparably if not better thanexisting category-specific reconstruction methods learned with supervision.