We aim to infer 3D shape and pose of object from a single image and propose alearning-based approach that can train from unstructured image collections,supervised by only segmentation outputs from off-the-shelf recognition systems(i.e. 'shelf-supervised'). We first infer a volumetric representation in acanonical frame, along with the camera pose. We enforce the representationgeometrically consistent with both appearance and masks, and also that thesynthesized novel views are indistinguishable from image collections. Thecoarse volumetric prediction is then converted to a mesh-based representation,which is further refined in the predicted camera frame. These two steps allowboth shape-pose factorization from image collections and per-instancereconstruction in finer details. We examine the method on both synthetic andreal-world datasets and demonstrate its scalability on 50 categories in thewild, an order of magnitude more classes than existing works.