With the advance of deep learning technology, automatic video generation fromaudio or text has become an emerging and promising research topic. In thispaper, we present a novel approach to synthesize video from the text. Themethod builds a phoneme-pose dictionary and trains a generative adversarialnetwork (GAN) to generate video from interpolated phoneme poses. Compared toaudio-driven video generation algorithms, our approach has a number ofadvantages: 1) It only needs a fraction of the training data used by anaudio-driven approach; 2) It is more flexible and not subject to vulnerabilitydue to speaker variation; 3) It significantly reduces the preprocessing,training and inference time. We perform extensive experiments to compare theproposed method with state-of-the-art talking face generation methods on abenchmark dataset and datasets of our own. The results demonstrate theeffectiveness and superiority of our approach.