Abstract
Linear concept vectors effectively steer LLMs, but existing methods sufferfrom noisy features in diverse datasets that undermine steering robustness. Wepropose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectivelykeep the most discriminative SAE latents while reconstructing hiddenrepresentations. Our key insight is that concept-relevant signals can beexplicitly separated from dataset noise by scaling up activations of top-klatents that best differentiate positive and negative samples. Applied tolinear probing and difference-in-mean, SDCV consistently improves steeringsuccess rates by 4-16\% across six challenging concepts, while maintainingtopic relevance.
Quick Read (beta)
loading the full paper ...