Abstract
Acquisition and rendering of photo-realistic human heads is a highlychallenging research problem of particular importance for virtual telepresence.Currently, the highest quality is achieved by volumetric approaches trained ina person specific manner on multi-view data. These models better represent finestructure, such as hair, compared to simpler mesh-based models. Volumetricmodels typically employ a global code to represent facial expressions, suchthat they can be driven by a small set of animation parameters. While sucharchitectures achieve impressive rendering quality, they can not easily beextended to the multi-identity setting. In this paper, we devise a novelapproach for predicting volumetric avatars of the human head given just a smallnumber of inputs. We enable generalization across identities by a novelparameterization that combines neural radiance fields with local, pixel-alignedfeatures extracted directly from the inputs, thus sidestepping the need forvery deep or complex networks. Our approach is trained in an end-to-end mannersolely based on a photometric re-rendering loss without requiring explicit 3Dsupervision.We demonstrate that our approach outperforms the existing state ofthe art in terms of quality and is able to generate faithful facial expressionsin a multi-identity setting.