VideoGPT: Video Generation using VQ-VAE and Transformers

  • 2021-04-20 17:58:03
  • Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas
  • 150


We present VideoGPT: a conceptually simple architecture for scalinglikelihood based generative modeling to natural videos. VideoGPT uses VQ-VAEthat learns downsampled discrete latent representations of a raw video byemploying 3D convolutions and axial self-attention. A simple GPT-likearchitecture is then used to autoregressively model the discrete latents usingspatio-temporal position encodings. Despite the simplicity in formulation andease of training, our architecture is able to generate samples competitive withstate-of-the-art GAN models for video generation on the BAIR Robot dataset, andgenerate high fidelity natural images from UCF-101 and Tumbler GIF Dataset(TGIF). We hope our proposed architecture serves as a reproducible referencefor a minimalistic implementation of transformer based video generation models.Samples and code are available at


Quick Read (beta)

loading the full paper ...