Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary

Abstract

With the advance of deep learning technology, automatic video generation fromaudio or text has become an emerging and promising research topic. In thispaper, we present a novel approach to synthesize video from the text. Themethod builds a phoneme-pose dictionary and trains a generative adversarialnetwork (GAN) to generate video from interpolated phoneme poses. Compared toaudio-driven video generation algorithms, our approach has a number ofadvantages: 1) It only needs a fraction of the training data used by anaudio-driven approach; 2) It is more flexible and not subject to vulnerabilitydue to speaker variation; 3) It significantly reduces the preprocessing,training and inference time. We perform extensive experiments to compare theproposed method with state-of-the-art talking face generation methods on abenchmark dataset and datasets of our own. The results demonstrate theeffectiveness and superiority of our approach.

Quick Read (beta)

loading the full paper ...