Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Abstract

We propose an audio-driven talking-head method to generate photo-realistictalking-head videos from a single reference image. In this work, we tackle twokey challenges: (i) producing natural head motions that match speech prosody,and (ii) maintaining the appearance of a speaker in a large head motion whilestabilizing the non-face regions. We first design a head pose predictor bymodeling rigid 6D head movements with a motion-aware recurrent neural network(RNN). In this way, the predicted head poses act as the low-frequency holisticmovements of a talking head, thus allowing our latter network to focus ondetailed facial movement generation. To depict the entire image motions arisingfrom audio, we exploit a keypoint based dense motion field representation.Then, we develop a motion field generator to produce the dense motion fieldsfrom input audio, head poses, and a reference image. As this keypoint basedrepresentation models the motions of facial regions, head, and backgroundsintegrally, our method can better constrain the spatial and temporalconsistency of the generated videos. Finally, an image generation network isemployed to render photo-realistic talking-head videos from the estimatedkeypoint based motion fields and the input reference image. Extensiveexperiments demonstrate that our method produces videos with plausible headmotions, synchronized facial expressions, and stable backgrounds andoutperforms the state-of-the-art.

Quick Read (beta)

loading the full paper ...