Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

  • 2022-11-14 11:52:55
  • Dominic Rampas, Pablo Pernias, Elea Zhong, Marc Aubreville
  • 72


Conditional text-to-image generation has seen countless recent improvementsin terms of quality, diversity and fidelity. Nevertheless, moststate-of-the-art models require numerous inference steps to produce faithfulgenerations, resulting in performance bottlenecks for end-user applications. Inthis paper we introduce Paella, a novel text-to-image model requiring less than10 steps to sample high-fidelity images, using a speed-optimized architectureallowing to sample a single image in less than 500 ms, while having 573Mparameters. The model operates on a compressed & quantized latent space, it isconditioned on CLIP embeddings and uses an improved sampling function overprevious works. Aside from text-conditional image generation, our model is ableto do latent space interpolation and image manipulations such as inpainting,outpainting, and structural editing. We release all of our code and pretrainedmodels at


Quick Read (beta)

loading the full paper ...