Abstract
Animating portraits using speech has received growing attention in recentyears, with various creative and practical use cases. An ideal generated videoshould have good lip sync with the audio, natural facial expressions and headmotions, and high frame quality. In this work, we present SPACEx, which usesspeech and a single image to generate high-resolution, and expressive videoswith realistic head pose, without requiring a driving video. It uses amulti-stage approach, combining the controllability of facial landmarks withthe high-quality synthesis power of a pretrained face generator. SPACEx alsoallows for the control of emotions and their intensities. Our methodoutperforms prior methods in objective metrics for image quality and facialmotions and is strongly preferred by users in pair-wise comparisons. Theproject website is available at https://deepimagination.cc/SPACEx/