Abstract
Conditional text-to-image generation has seen countless recent improvementsin terms of quality, diversity and fidelity. Nevertheless, moststate-of-the-art models require numerous inference steps to produce faithfulgenerations, resulting in performance bottlenecks for end-user applications. Inthis paper we introduce Paella, a novel text-to-image model requiring less than10 steps to sample high-fidelity images, using a speed-optimized architectureallowing to sample a single image in less than 500 ms, while having 573Mparameters. The model operates on a compressed & quantized latent space, it isconditioned on CLIP embeddings and uses an improved sampling function overprevious works. Aside from text-conditional image generation, our model is ableto do latent space interpolation and image manipulations such as inpainting,outpainting, and structural editing. We release all of our code and pretrainedmodels at https://github.com/dome272/Paella