Abstract
Knowledge distillation methods have recently shown to be a promisingdirection to speedup the synthesis of large-scale diffusion models by requiringonly a few inference steps. While several powerful distillation methods wererecently proposed, the overall quality of student samples is typically lowercompared to the teacher ones, which hinders their practical usage. In thiswork, we investigate the relative quality of samples produced by the teachertext-to-image diffusion model and its distilled student version. As our mainempirical finding, we discover that a noticeable portion of student samplesexhibit superior fidelity compared to the teacher ones, despite the``approximate'' nature of the student. Based on this finding, we propose anadaptive collaboration between student and teacher diffusion models foreffective text-to-image synthesis. Specifically, the distilled model producesthe initial sample, and then an oracle decides whether it needs furtherimprovements with a slow teacher model. Extensive experiments demonstrate thatthe designed pipeline surpasses state-of-the-art text-to-image alternatives forvarious inference budgets in terms of human preference. Furthermore, theproposed approach can be naturally used in popular applications such astext-guided image editing and controllable generation.