Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

Abstract

The quality of the prompts provided to text-to-image diffusion modelsdetermines how faithful the generated content is to the user's intent, oftenrequiring `prompt engineering'. To harness visual concepts from target imageswithout prompt engineering, current approaches largely rely on embeddinginversion by optimizing and then mapping them to pseudo-tokens. However,working with such high-dimensional vector representations is challengingbecause they lack semantics and interpretability, and only allow simple vectoroperations when using them. Instead, this work focuses on inverting thediffusion model to obtain interpretable language prompts directly. Thechallenge of doing this lies in the fact that the resulting optimizationproblem is fundamentally discrete and the space of prompts is exponentiallylarge; this makes using standard optimization techniques, such as stochasticgradient descent, difficult. To this end, we utilize a delayed projectionscheme to optimize for prompts representative of the vocabulary space in themodel. Further, we leverage the findings that different timesteps of thediffusion process cater to different levels of detail in an image. The later,noisy, timesteps of the forward diffusion process correspond to the semanticinformation, and therefore, prompt inversion in this range provides tokensrepresentative of the image semantics. We show that our approach can identifysemantically interpretable and meaningful prompts for a target image which canbe used to synthesize diverse images with similar content. We furtherillustrate the application of the optimized prompts in evolutionary imagegeneration and concept removal.

Quick Read (beta)

loading the full paper ...