InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

Abstract

Diffusion models have revolutionized text-to-image generation with itsexceptional quality and creativity. However, its multi-step sampling process isknown to be slow, often requiring tens of inference steps to obtainsatisfactory results. Previous attempts to improve its sampling speed andreduce computational costs through distillation have been unsuccessful inachieving a functional one-step model. In this paper, we explore a recentmethod called Rectified Flow, which, thus far, has only been applied to smalldatasets. The core of Rectified Flow lies in its \emph{reflow} procedure, whichstraightens the trajectories of probability flows, refines the coupling betweennoises and images, and facilitates the distillation process with studentmodels. We propose a novel text-conditioned pipeline to turn Stable Diffusion(SD) into an ultra-fast one-step model, in which we find reflow plays acritical role in improving the assignment between noise and images. Leveragingour new pipeline, we create, to the best of our knowledge, the first one-stepdiffusion-based text-to-image generator with SD-level image quality, achievingan FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassingthe previous state-of-the-art technique, progressive distillation, by asignificant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing anexpanded network with 1.7B parameters, we further improve the FID to $22.4$. Wecall our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlowyields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ secondregime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably,the training of InstaFlow only costs 199 A100 GPU days. Projectpage:~\url{https://github.com/gnobitab/InstaFlow}.

Quick Read (beta)

loading the full paper ...