Megapixel Image Generation with Step-Unrolled Denoising Autoencoders

Abstract

An ongoing trend in generative modelling research has been to push sampleresolutions higher whilst simultaneously reducing computational requirementsfor training and sampling. We aim to push this trend further via thecombination of techniques - each component representing the current pinnacle ofefficiency in their respective areas. These include vector-quantized GAN(VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy -but perceptually insignificant - compression; hourglass transformers, a highlyscaleable self-attention model; and step-unrolled denoising autoencoders(SUNDAE), a non-autoregressive (NAR) text generative model. Unexpectedly, ourmethod highlights weaknesses in the original formulation of hourglasstransformers when applied to multidimensional data. In light of this, wepropose modifications to the resampling mechanism, applicable in any taskapplying hierarchical transformers to multidimensional data. Additionally, wedemonstrate the scalability of SUNDAE to long sequence lengths - four timeslonger than prior work. Our proposed framework scales to high-resolutions($1024 \times 1024$) and trains quickly (2-4 days). Crucially, the trainedmodel produces diverse and realistic megapixel samples in approximately 2seconds on a consumer-grade GPU (GTX 1080Ti). In general, the framework isflexible: supporting an arbitrary number of sampling steps, sample-wiseself-stopping, self-correction capabilities, conditional generation, and a NARformulation that allows for arbitrary inpainting masks. We obtain FID scores of10.56 on FFHQ256 - close to the original VQ-GAN in less than half the samplingsteps - and 21.85 on FFHQ1024 in only 100 sampling steps.

Quick Read (beta)

loading the full paper ...