Dense SAE Latents Are Features, Not Bugs

Abstract

Sparse autoencoders (SAEs) are designed to extract interpretable featuresfrom language models by enforcing a sparsity constraint. Ideally, training anSAE would yield latents that are both sparse and semantically meaningful.However, many SAE latents activate frequently (i.e., are \emph{dense}), raisingconcerns that they may be undesirable artifacts of the training procedure. Inthis work, we systematically investigate the geometry, function, and origin ofdense latents and show that they are not only persistent but often reflectmeaningful model representations. We first demonstrate that dense latents tendto form antipodal pairs that reconstruct specific directions in the residualstream, and that ablating their subspace suppresses the emergence of new densefeatures in retrained SAEs -- suggesting that high density features are anintrinsic property of the residual space. We then introduce a taxonomy of denselatents, identifying classes tied to position tracking, context binding,entropy regulation, letter-specific output signals, part-of-speech, andprincipal component reconstruction. Finally, we analyze how these featuresevolve across layers, revealing a shift from structural features in earlylayers, to semantic features in mid layers, and finally to output-orientedsignals in the last layers of the model. Our findings indicate that denselatents serve functional roles in language model computation and should not bedismissed as training noise.

Quick Read (beta)

loading the full paper ...