Abstract
Diffusion models have made significant breakthroughs in image, audio, andvideo generation, but they depend on an iterative generation process thatcauses slow sampling speed and caps their potential for real-time applications.To overcome this limitation, we propose consistency models, a new family ofgenerative models that achieve high sample quality without adversarialtraining. They support fast one-step generation by design, while still allowingfor few-step sampling to trade compute for sample quality. They also supportzero-shot data editing, like image inpainting, colorization, andsuper-resolution, without requiring explicit training on these tasks.Consistency models can be trained either as a way to distill pre-traineddiffusion models, or as standalone generative models. Through extensiveexperiments, we demonstrate that they outperform existing distillationtechniques for diffusion models in one- and few-step generation. For example,we achieve the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 onImageNet 64x64 for one-step generation. When trained as standalone generativemodels, consistency models also outperform single-step, non-adversarialgenerative models on standard benchmarks like CIFAR-10, ImageNet 64x64 and LSUN256x256.