Steering CLIP's vision transformer with sparse autoencoders

Abstract

While vision models are highly capable, their internal mechanisms remainpoorly understood -- a challenge which sparse autoencoders (SAEs) have helpedaddress in language, but which remains underexplored in vision. We address thisgap by training SAEs on CLIP's vision transformer and uncover key differencesbetween vision and language processing, including distinct sparsity patternsfor SAEs trained across layers and token types. We then provide the firstsystematic analysis on the steerability of CLIP's vision transformer byintroducing metrics to quantify how precisely SAE features can be steered toaffect the model's output. We find that 10-15\% of neurons and features aresteerable, with SAEs providing thousands more steerable features than the basemodel. Through targeted suppression of SAE features, we then demonstrateimproved performance on three vision disentanglement tasks (CelebA, Waterbirds,and typographic attacks), finding optimal disentanglement in middle modellayers, and achieving state-of-the-art performance on defense againsttypographic attacks.

Quick Read (beta)

loading the full paper ...