Everybody's Talkin': Let Me Talk as You Want

Abstract

We present a method to edit a target portrait footage by taking a sequence ofaudio as input to synthesize a photo-realistic video. This method is uniquebecause it is highly dynamic. It does not assume a person-specific renderingnetwork yet capable of translating arbitrary source audio into arbitrary videooutput. Instead of learning a highly heterogeneous and nonlinear mapping fromaudio to the video directly, we first factorize each target video frame intoorthogonal parameter spaces, i.e., expression, geometry, and pose, viamonocular 3D face reconstruction. Next, a recurrent network is introduced totranslate source audio into expression parameters that are primarily related tothe audio content. The audio-translated expression parameters are then used tosynthesize a photo-realistic human subject in each video frame, with themovement of the mouth regions precisely mapped to the source audio. Thegeometry and pose parameters of the target human portrait are retained,therefore preserving the context of the original video footage. Finally, weintroduce a novel video rendering network and a dynamic programming method toconstruct a temporally coherent and photo-realistic video. Extensiveexperiments demonstrate the superiority of our method over existing approaches.Our method is end-to-end learnable and robust to voice variations in the sourceaudio.

Quick Read (beta)

loading the full paper ...