Disentangling Dense Embeddings with Sparse Autoencoders

Abstract

Sparse autoencoders (SAEs) have shown promise in extracting interpretablefeatures from complex neural networks. We present one of the first applicationsof SAEs to dense text embeddings from large language models, demonstratingtheir effectiveness in disentangling semantic concepts. By training SAEs onembeddings of over 420,000 scientific paper abstracts from computer science andastronomy, we show that the resulting sparse representations maintain semanticfidelity while offering interpretability. We analyse these learned features,exploring their behaviour across different model capacities and introducing anovel method for identifying ``feature families'' that represent relatedconcepts at varying levels of abstraction. To demonstrate the practical utilityof our approach, we show how these interpretable features can be used toprecisely steer semantic search, allowing for fine-grained control over querysemantics. This work bridges the gap between the semantic richness of denseembeddings and the interpretability of sparse representations. We open sourceour embeddings, trained sparse autoencoders, and interpreted features, as wellas a web app for exploring them.

Quick Read (beta)

loading the full paper ...