Abstract
We present VideoGPT: a conceptually simple architecture for scalinglikelihood based generative modeling to natural videos. VideoGPT uses VQ-VAEthat learns downsampled discrete latent representations of a raw video byemploying 3D convolutions and axial self-attention. A simple GPT-likearchitecture is then used to autoregressively model the discrete latents usingspatio-temporal position encodings. Despite the simplicity in formulation andease of training, our architecture is able to generate samples competitive withstate-of-the-art GAN models for video generation on the BAIR Robot dataset, andgenerate high fidelity natural images from UCF-101 and Tumbler GIF Dataset(TGIF). We hope our proposed architecture serves as a reproducible referencefor a minimalistic implementation of transformer based video generation models.Samples and code are available athttps://wilson1yan.github.io/videogpt/index.html