MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

Abstract

This paper presents a generic method for generating full facial 3D animationfrom speech. Existing approaches to audio-driven facial animation exhibituncanny or static upper face animation, fail to produce accurate and plausibleco-articulation or rely on person-specific models that limit their scalability.To improve upon existing models, we propose a generic audio-driven facialanimation approach that achieves highly realistic motion synthesis results forthe entire face. At the core of our approach is a categorical latent space forfacial animation that disentangles audio-correlated and audio-uncorrelatedinformation based on a novel cross-modality loss. Our approach ensures highlyaccurate lip motion, while also synthesizing plausible animation of the partsof the face that are uncorrelated to the audio signal, such as eye blinks andeye brow motion. We demonstrate that our approach outperforms several baselinesand obtains state-of-the-art quality both qualitatively and quantitatively. Aperceptual user study demonstrates that our approach is deemed more realisticthan the current state-of-the-art in over 75% of cases. We recommend watchingthe supplemental video before reading the paper:https://research.fb.com/wp-content/uploads/2021/04/mesh_talk.mp4

Quick Read (beta)

loading the full paper ...