Abstract
The extension of image generation to video generation turns out to be a verydifficult task, since the temporal dimension of videos introduces an extrachallenge during the generation process. Besides, due to the limitation ofmemory and training stability, the generation becomes increasingly challengingwith the increase of the resolution/duration of videos. In this work, weexploit the idea of progressive growing of Generative Adversarial Networks(GANs) for higher resolution video generation. In particular, we begin toproduce video samples of low-resolution and short-duration, and thenprogressively increase both resolution and duration alone (or jointly) byadding new spatiotemporal convolutional layers to the current networks.Starting from the learning on a very raw-level spatial appearance and temporalmovement of the video distribution, the proposed progressive method learnsspatiotemporal information incrementally to generate higher resolution videos.Furthermore, we introduce a sliced version of Wasserstein GAN (SWGAN) loss toimprove the distribution learning on the video data of high-dimension andmixed-spatiotemporal distribution. SWGAN loss replaces the distance betweenjoint distributions by that of one-dimensional marginal distributions, makingthe loss easier to compute. We evaluate the proposed model on our collectedface video dataset of 10,900 videos to generate photorealistic face videos of256x256x32 resolution. In addition, our model also reaches a record inceptionscore of 14.57 in unsupervised action recognition dataset UCF-101.