Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Abstract

Obtaining human-interpretable explanations of large, general-purpose languagemodels is an urgent goal for AI safety. However, it is just as important thatour interpretability methods are faithful to the causal dynamics underlyingmodel behavior and able to robustly generalize to unseen inputs. DistributedAlignment Search (DAS) is a powerful gradient descent method grounded in atheory of causal abstraction that uncovered perfect alignments betweeninterpretable symbolic algorithms and small deep learning models fine-tuned forspecific tasks. In the present paper, we scale DAS significantly by replacingthe remaining brute-force search steps with learned parameters -- an approachwe call DAS. This enables us to efficiently search for interpretable causalstructure in large language models while they follow instructions. We apply DASto the Alpaca model (7B parameters), which, off the shelf, solves a simplenumerical reasoning problem. With DAS, we discover that Alpaca does this byimplementing a causal model with two interpretable boolean variables.Furthermore, we find that the alignment of neural representations with thesevariables is robust to changes in inputs and instructions. These findings marka first step toward deeply understanding the inner-workings of our largest andmost widely deployed language models.

Quick Read (beta)

loading the full paper ...