Learning deformable 3D objects from 2D images is an extremely ill-posedproblem. Existing methods rely on explicit supervision to establish multi-viewcorrespondences, such as template shape models and keypoint annotations, whichrestricts their applicability on objects "in the wild". In this paper, wepropose to use monocular videos, which naturally provide correspondences acrosstime, allowing us to learn 3D shapes of deformable object categories withoutexplicit keypoints or template shapes. Specifically, we present DOVE, whichlearns to predict 3D canonical shape, deformation, viewpoint and texture from asingle 2D image of a bird, given a bird video collection as well asautomatically obtained silhouettes and optical flows as training data. Ourmethod reconstructs temporally consistent 3D shape and deformation, whichallows us to animate and re-render the bird from arbitrary viewpoints from asingle image.