Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Abstract

Linear concept vectors effectively steer LLMs, but existing methods sufferfrom noisy features in diverse datasets that undermine steering robustness. Wepropose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectivelykeep the most discriminative SAE latents while reconstructing hiddenrepresentations. Our key insight is that concept-relevant signals can beexplicitly separated from dataset noise by scaling up activations of top-klatents that best differentiate positive and negative samples. Applied tolinear probing and difference-in-mean, SDCV consistently improves steeringsuccess rates by 4-16\% across six challenging concepts, while maintainingtopic relevance.

Quick Read (beta)

loading the full paper ...