Decomposing The Dark Matter of Sparse Autoencoders

Abstract

Sparse autoencoders (SAEs) are a promising technique for decomposing languagemodel activations into interpretable linear features. However, current SAEsfall short of completely explaining model performance, resulting in "darkmatter": unexplained variance in activations. This work investigates darkmatter as an object of study in its own right. Surprisingly, we find that muchof SAE dark matter--about half of the error vector itself and >90% of itsnorm--can be linearly predicted from the initial activation vector.Additionally, we find that the scaling behavior of SAE error norms at a pertoken level is remarkably predictable: larger SAEs mostly struggle toreconstruct the same contexts as smaller SAEs. We build on the linearrepresentation hypothesis to propose models of activations that might lead tothese observations, including postulating a new type of "introduced error";these insights imply that the part of the SAE error vector that cannot belinearly predicted ("nonlinear" error) might be fundamentally different fromthe linearly predictable component. To validate this hypothesis, we empiricallyanalyze nonlinear SAE error and show that 1) it contains fewer not yet learnedfeatures, 2) SAEs trained on it are quantitatively worse, 3) it helps predictSAE per-token scaling behavior, and 4) it is responsible for a proportionalamount of the downstream increase in cross entropy loss when SAE activationsare inserted into the model. Finally, we examine two methods to reducenonlinear SAE error at a fixed sparsity: inference time gradient pursuit, whichleads to a very slight decrease in nonlinear error, and linear transformationsfrom earlier layer SAE outputs, which leads to a larger reduction.

Quick Read (beta)

loading the full paper ...