Abstract
We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model ofTransformer attention layers to disentangle original Multi Head Self Attention(MHSA) into individually comprehensible components. Lorsa is designed toaddress the challenge of attention superposition to understandattention-mediated interaction between features in different token positions.We show that Lorsa heads find cleaner and finer-grained versions of previouslydiscovered MHSA behaviors like induction heads, successor heads and attentionsink behavior (i.e., heavily attending to the first token). Lorsa and SparseAutoencoder (SAE) are both sparse dictionary learning methods applied todifferent Transformer components, and lead to consistent findings in many ways.For instance, we discover a comprehensive family of arithmetic-specific Lorsaheads, each corresponding to an atomic operation in Llama-3.1-8B. Automatedinterpretability analysis indicates that Lorsa achieves parity with SAE ininterpretability while Lorsa exhibits superior circuit discovery properties,especially for features computed collectively by multiple MHSA heads. We alsoconduct extensive experiments on architectural design ablation, Lorsa scalinglaw and error analysis.