MakeItTalk: Speaker-Aware Talking-Head Animation

Abstract

We present a method that generates expressive talking heads from a singlefacial image with audio as the only input. In contrast to previous approachesthat attempt to learn direct mappings from audio to raw pixels or points forcreating talking faces, our method first disentangles the content and speakerinformation in the input audio signal. The audio content robustly controls themotion of lips and nearby facial regions, while the speaker informationdetermines the specifics of facial expressions and the rest of the talking headdynamics. Another key component of our method is the prediction of faciallandmarks reflecting speaker-aware dynamics. Based on this intermediaterepresentation, our method is able to synthesize photorealistic videos ofentire talking heads with full range of motion and also animate artisticpaintings, sketches, 2D cartoon characters, Japanese mangas, stylizedcaricatures in a single unified framework. We present extensive quantitativeand qualitative evaluation of our method, in addition to user studies,demonstrating generated talking heads of significantly higher quality comparedto prior state-of-the-art.

Quick Read (beta)

loading the full paper ...