SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

Abstract

Despite their ability to generate high-resolution and diverse images fromtext prompts, text-to-image diffusion models often suffer from slow iterativesampling processes. Model distillation is one of the most effective directionsto accelerate these models. However, previous distillation methods fail toretain the generation quality while requiring a significant amount of imagesfor training, either from real data or synthetically generated by the teachermodel. In response to this limitation, we present a novel image-freedistillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration fromtext-to-3D synthesis, in which a 3D neural radiance field that aligns with theinput prompt can be obtained from a 2D text-to-image diffusion prior via aspecialized loss without the use of any 3D data ground-truth, our approachre-purposes that same loss for distilling a pretrained multi-step text-to-imagemodel to a student network that can generate high-fidelity images with just asingle inference step. In spite of its simplicity, our model stands as one ofthe first one-step text-to-image generators that can produce images ofcomparable quality to Stable Diffusion without reliance on any training imagedata. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and aCLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitiveresults or even substantially surpassing existing state-of-the-art distillationtechniques.

Quick Read (beta)

loading the full paper ...