Abstract
We present a learning-based model to infer the personalized 3D shape ofpeople from a few frames (1-8) of a monocular video in which the person ismoving, in less than 10 seconds with a reconstruction accuracy of 5mm. Ourmodel learns to predict the parameters of a statistical body model and instancedisplacements that add clothing and hair to the shape. The model achieves fastand accurate predictions based on two key design choices. First, by predictingshape in a canonical T-pose space, the network learns to encode the images ofthe person into pose-invariant latent codes, where the information is fused.Second, based on the observation that feed-forward predictions are fast but donot always align with the input images, we predict using both, bottom-up andtop-down streams (one per view) allowing information to flow in bothdirections. Learning relies only on synthetic 3D data. Once learned, the modelcan take a variable number of frames as input, and is able to reconstructshapes even from a single image with an accuracy of 6mm. Results on 3 differentdatasets demonstrate the efficacy and accuracy of our approach.