Intriguing Properties of Text-guided Diffusion Models

Abstract

Text-guided diffusion models (TDMs) are widely applied but can failunexpectedly. Common failures include: (i) natural-looking text promptsgenerating images with the wrong content, or (ii) different random samples ofthe latent variables that generate vastly different, and even unrelated,outputs despite being conditioned on the same text prompt. In this work, we aimto study and understand the failure modes of TDMs in more detail. To achievethis, we propose SAGE, an adversarial attack on TDMs that uses imageclassifiers as surrogate loss functions, to search over the discrete promptspace and the high-dimensional latent space of TDMs to automatically discoverunexpected behaviors and failure cases in the image generation. We make severaltechnical contributions to ensure that SAGE finds failure cases of thediffusion model, rather than the classifier, and verify this in a human study.Our study reveals four intriguing properties of TDMs that have not beensystematically studied before: (1) We find a variety of natural text promptsproducing images that fail to capture the semantics of input texts. Wecategorize these failures into ten distinct types based on the underlyingcauses. (2) We find samples in the latent space (which are not outliers) thatlead to distorted images independent of the text prompt, suggesting that partsof the latent space are not well-structured. (3) We also find latent samplesthat lead to natural-looking images which are unrelated to the text prompt,implying a potential misalignment between the latent and prompt spaces. (4) Byappending a single adversarial token embedding to an input prompt we cangenerate a variety of specified target objects, while only minimally affectingthe CLIP score. This demonstrates the fragility of language representations andraises potential safety concerns.

Quick Read (beta)

loading the full paper ...