Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Abstract

Given that interpretability and steerability are crucial to AI safety, SparseAutoencoders (SAEs) have emerged as a tool to enhance them in Large LanguageModels (LLMs). In this work, we extend the application of SAEs toVision-Language Models (VLMs), such as CLIP, and introduce a comprehensiveframework for evaluating monosemanticity at the neuron-level in visionrepresentations. To ensure that our evaluation aligns with human perception, wepropose a benchmark derived from a large-scale user study. Our experimentalresults reveal that SAEs trained on VLMs significantly enhance themonosemanticity of individual neurons, with sparsity and wide latents being themost influential factors. Notably, we demonstrate that applying SAEinterventions on CLIP's vision encoder directly steers multimodal LLM outputs(e.g., LLaVA), without any modifications to the underlying model. Thesefindings emphasize the practicality and efficacy of SAEs as an unsupervisedtool for enhancing both interpretability and control of VLMs. Code is availableat https://github.com/ExplainableML/sae-for-vlm.

Quick Read (beta)

loading the full paper ...