SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Abstract

Text-to-image diffusion models can create stunning images from naturallanguage descriptions that rival the work of professional artists andphotographers. However, these models are large, with complex networkarchitectures and tens of denoising iterations, making them computationallyexpensive and slow to run. As a result, high-end GPUs and cloud-based inferenceare required to run diffusion models at scale. This is costly and has privacyimplications, especially when user data is sent to a third party. To overcomethese challenges, we present a generic approach that, for the first time,unlocks running text-to-image diffusion models on mobile devices in less than$2$ seconds. We achieve so by introducing efficient network architecture andimproving step distillation. Specifically, we propose an efficient UNet byidentifying the redundancy of the original model and reducing the computationof the image decoder via data distillation. Further, we enhance the stepdistillation by exploring training strategies and introducing regularizationfrom classifier-free guidance. Our extensive experiments on MS-COCO show thatour model with $8$ denoising steps achieves better FID and CLIP scores thanStable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creationby bringing powerful text-to-image diffusion models to the hands of users.

Quick Read (beta)

loading the full paper ...