AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Abstract

Fine-grained steering of language model outputs is essential for safety andreliability. Prompting and finetuning are widely used to achieve these goals,but interpretability researchers have proposed a variety ofrepresentation-based techniques as well, including sparse autoencoders (SAEs),linear artificial tomography, supervised steering vectors, linear probes, andrepresentation finetuning. At present, there is no benchmark for making directcomparisons between these proposals. Therefore, we introduce AxBench, alarge-scale benchmark for steering and concept detection, and reportexperiments on Gemma-2-2B and 9B. For steering, we find that promptingoutperforms all existing methods, followed by finetuning. For conceptdetection, representation-based methods such as difference-in-means, performthe best. On both evaluations, SAEs are not competitive. We introduce a novelweakly-supervised representational method (Rank-1 Representation Finetuning;ReFT-r1), which is competitive on both tasks while providing theinterpretability advantages that prompting lacks. Along with AxBench, we trainand publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.

Quick Read (beta)

loading the full paper ...