Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

  • 2025-07-29 21:40:42
  • Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du
  • 0

Abstract

Linear concept vectors effectively steer LLMs, but existing methods sufferfrom noisy features in diverse datasets that undermine steering robustness. Wepropose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectivelykeep the most discriminative SAE latents while reconstructing hiddenrepresentations. Our key insight is that concept-relevant signals can beexplicitly separated from dataset noise by scaling up activations of top-klatents that best differentiate positive and negative samples. Applied tolinear probing and difference-in-mean, SDCV consistently improves steeringsuccess rates by 4-16\% across six challenging concepts, while maintainingtopic relevance.

 

Quick Read (beta)

loading the full paper ...