SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Abstract

Sparse autoencoders (SAEs) are a popular technique for interpreting languagemodel activations, and there is extensive recent work on improving SAEeffectiveness. However, most prior work evaluates progress using unsupervisedproxy metrics with unclear practical relevance. We introduce SAEBench, acomprehensive evaluation suite that measures SAE performance across eightdiverse metrics, spanning interpretability, feature disentanglement andpractical applications like unlearning. To enable systematic comparison, weopen-source a suite of over 200 SAEs across eight recently proposed SAEarchitectures and training algorithms. Our evaluation reveals that gains onproxy metrics do not reliably translate to better practical performance. Forinstance, while Matryoshka SAEs slightly underperform on existing proxymetrics, they substantially outperform other architectures on featuredisentanglement metrics; moreover, this advantage grows with SAE scale. Byproviding a standardized framework for measuring progress in SAE development,SAEBench enables researchers to study scaling trends and make nuancedcomparisons between different SAE architectures and training methodologies. Ourinteractive interface enables researchers to flexibly visualize relationshipsbetween metrics across hundreds of open-source SAEs at:www.neuronpedia.org/sae-bench

Quick Read (beta)

loading the full paper ...