What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Abstract

Large-scale diffusion neural networks represent a substantial milestone intext-to-image generation, with some performing similar to real photographs inhuman evaluation. However, they remain poorly understood, lackingexplainability and interpretability analyses, largely due to their proprietary,closed-source nature. In this paper, to shine some much-needed light ontext-to-image diffusion models, we perform a text-image attribution analysis onStable Diffusion, a recently open-sourced large diffusion model. To producepixel-level attribution maps, we propose DAAM, a novel method based onupscaling and aggregating cross-attention activations in the latent denoisingsubnetwork. We support its correctness by evaluating its unsupervised instancesegmentation quality on its own generated imagery, compared to supervisedsegmentation models. We show that DAAM performs strongly on COCOcaption-generated images, achieving an average precision (AP) of 61.0, and itoutperforms supervised models on full-vocabulary segmentation, for an AP of51.5. We further find that certain parts of speech, like punctuation andconjunctions, influence the generated imagery most, which agrees with the priorliterature, while determiners and numerals the least, suggesting poor numeracy.To our knowledge, we are the first to propose and study word--pixel attributionfor large-scale text-to-image diffusion models. Our code and data are athttps://github.com/castorini/daam

Quick Read (beta)

loading the full paper ...