Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Abstract

Recent work has explored how individual components of the CLIP-ViT modelcontribute to the final representation by leveraging the shared image-textrepresentation space of CLIP. These components, such as attention heads andMLPs, have been shown to capture distinct image features like shape, color ortexture. However, understanding the role of these components in arbitraryvision transformers (ViTs) is challenging. To this end, we introduce a generalframework which can identify the roles of various components in ViTs beyondCLIP. Specifically, we (a) automate the decomposition of the finalrepresentation into contributions from different model components, and (b)linearly map these contributions to CLIP space to interpret them via text.Additionally, we introduce a novel scoring function to rank components by theirimportance with respect to specific features. Applying our framework to variousViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into theroles of different components concerning particular image features. Theseinsights facilitate applications such as image retrieval using textdescriptions or reference images, visualizing token importance heatmaps, andmitigating spurious correlations. We release our code to reproduce theexperiments at https://github.com/SriramB-98/vit-decompose

Quick Read (beta)

loading the full paper ...