Abstract
We consider the problem of learning a function that can estimate the 3Dshape, articulation, viewpoint, texture, and lighting of an articulated animallike a horse, given a single test image. We present a new method, dubbedMagicPony, that learns this function purely from in-the-wild single-view imagesof the object category, with minimal assumptions about the topology ofdeformation. At its core is an implicit-explicit representation of articulatedshape and appearance, combining the strengths of neural fields and meshes. Inorder to help the model understand an object's shape and pose, we distil theknowledge captured by an off-the-shelf self-supervised vision transformer andfuse it into the 3D model. To overcome common local optima in viewpointestimation, we further introduce a new viewpoint sampling scheme that comes atno added training cost. Compared to prior works, we show significantquantitative and qualitative improvements on this challenging task. The modelalso demonstrates excellent generalisation in reconstructing abstract drawingsand artefacts, despite the fact that it is only trained on real images.