VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

Abstract

Current talking face generation methods mainly focus on speech-lipsynchronization. However, insufficient investigation on the facial talkingstyle leads to a lifeless and monotonous avatar. Most previous works fail toimitate expressive styles from arbitrary video prompts and ensure theauthenticity of the generated video. This paper proposes an unsupervisedvariational style transfer model (VAST) to vivify the neutral photo-realisticavatars. Our model consists of three key components: a style encoder thatextracts facial style representations from the given video prompts; a hybridfacial expression decoder to model accurate speech-related movements; avariational style enhancer that enhances the style space to be highlyexpressive and meaningful. With our essential designs on facial style learning,our model is able to flexibly capture the expressive facial style fromarbitrary video prompts and transfer it onto a personalized image renderer in azero-shot manner. Experimental results demonstrate the proposed approachcontributes to a more vivid talking avatar with higher authenticity and richerexpressiveness.

Quick Read (beta)

loading the full paper ...