Abstract
Text-to-image diffusion models produce impressive results but are frustratingtools for artists who desire fine-grained control. For example, a common usecase is to create images of a specific instance in novel contexts, i.e.,"identity-preserving generation". This setting, along with many other tasks(e.g., relighting), is a natural fit for image+text-conditional generativemodels. However, there is insufficient high-quality paired data to train such amodel directly. We propose Diffusion Self-Distillation, a method for using apre-trained text-to-image model to generate its own dataset fortext-conditioned image-to-image tasks. We first leverage a text-to-imagediffusion model's in-context generation ability to create grids of images andcurate a large paired dataset with the help of a Visual-Language Model. We thenfine-tune the text-to-image model into a text+image-to-image model using thecurated paired dataset. We demonstrate that Diffusion Self-Distillationoutperforms existing zero-shot methods and is competitive with per-instancetuning techniques on a wide range of identity-preservation generation tasks,without requiring test-time optimization.