Abstract
We present a framework for learning single-view shape and pose predictionwithout using direct supervision for either. Our approach allows leveragingmulti-view observations from unknown poses as supervisory signal duringtraining. Our proposed training setup enforces geometric consistency betweenthe independently predicted shape and pose from two views of the same instance.We consequently learn to predict shape in an emergent canonical (view-agnostic)frame along with a corresponding pose predictor. We show empirical andqualitative results using the ShapeNet dataset and observe encouraginglycompetitive performance to previous techniques which rely on stronger forms ofsupervision. We also demonstrate the applicability of our framework in arealistic setting which is beyond the scope of existing techniques: using atraining dataset comprised of online product images where the underlying shapeand pose are unknown.