FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features

Abstract

The task of face reenactment is to transfer the head motion and facialexpressions from a driving video to the appearance of a source image, which maybe of a different person (cross-reenactment). Most existing methods areCNN-based and estimate optical flow from the source image to the currentdriving frame, which is then inpainted and refined to produce the outputanimation. We propose a transformer-based encoder for computing a set-latentrepresentation of the source image(s). We then predict the output color of aquery pixel using a transformer-based decoder, which is conditioned withkeypoints and a facial expression vector extracted from the driving frame.Latent representations of the source person are learned in a self-supervisedmanner that factorize their appearance, head pose, and facial expressions.Thus, they are perfectly suited for cross-reenactment. In contrast to mostrelated work, our method naturally extends to multiple source images and canthus adapt to person-specific facial dynamics. We also propose dataaugmentation and regularization schemes that are necessary to preventoverfitting and support generalizability of the learned representations. Weevaluated our approach in a randomized user study. The results indicatesuperior performance compared to the state-of-the-art in terms of motiontransfer quality and temporal consistency.

Quick Read (beta)

loading the full paper ...