Cost-Aware Routing for Efficient Text-To-Image Generation

Abstract

Diffusion models are well known for their ability to generate a high-fidelityimage for an input prompt through an iterative denoising process.Unfortunately, the high fidelity also comes at a high computational cost duethe inherently sequential generative process. In this work, we seek tooptimally balance quality and computational cost, and propose a framework toallow the amount of computation to vary for each prompt, depending on itscomplexity. Each prompt is automatically routed to the most appropriatetext-to-image generation function, which may correspond to a distinct number ofdenoising steps of a diffusion model, or a disparate, independent text-to-imagemodel. Unlike uniform cost reduction techniques (e.g., distillation, modelquantization), our approach achieves the optimal trade-off by learning toreserve expensive choices (e.g., 100+ denoising steps) only for a few complexprompts, and employ more economical choices (e.g., small distilled model) forless sophisticated prompts. We empirically demonstrate on COCO and DiffusionDBthat by learning to route to nine already-trained text-to-image models, ourapproach is able to deliver an average quality that is higher than thatachievable by any of these models alone.

Quick Read (beta)

loading the full paper ...