Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Abstract

Whilst diffusion probabilistic models can generate high quality imagecontent, key limitations remain in terms of both generating high-resolutionimagery and their associated high computational requirements. RecentVector-Quantized image models have overcome this limitation of image resolutionbut are prohibitively slow and unidirectional as they generate tokens viaelement-wise autoregressive sampling from the prior. By contrast, in this paperwe propose a novel discrete diffusion probabilistic model prior which enablesparallel prediction of Vector-Quantized tokens by using an unconstrainedTransformer architecture as the backbone. During training, tokens are randomlymasked in an order-agnostic manner and the Transformer learns to predict theoriginal tokens. This parallelism of Vector-Quantized token prediction in turnfacilitates unconditional generation of globally consistent high-resolution anddiverse imagery at a fraction of the computational expense. In this manner, wecan generate image resolutions exceeding that of the original training setsamples whilst additionally provisioning per-image likelihood estimates (in adeparture from generative adversarial approaches). Our approach achievesstate-of-the-art results in terms of Density (LSUN Bedroom: 1.51; LSUNChurches: 1.12; FFHQ: 1.20) and Coverage (LSUN Bedroom: 0.83; LSUN Churches:0.73; FFHQ: 0.80), and performs competitively on FID (LSUN Bedroom: 3.64; LSUNChurches: 4.07; FFHQ: 6.11) whilst offering advantages in terms of bothcomputation and reduced training set requirements.

Quick Read (beta)

loading the full paper ...