Compact Proofs of Model Performance via Mechanistic Interpretability

Abstract

In this work, we propose using mechanistic interpretability -- techniques forreverse engineering model weights into human-interpretable algorithms -- toderive and compactly prove formal guarantees on model performance. We prototypethis approach by formally proving lower bounds on the accuracy of 151 smalltransformers trained on a Max-of-$K$ task. We create 102 differentcomputer-assisted proof strategies and assess their length and tightness ofbound on each of our models. Using quantitative metrics, we find that shorterproofs seem to require and provide more mechanistic understanding. Moreover, wefind that more faithful mechanistic understanding leads to tighter performancebounds. We confirm these connections by qualitatively examining a subset of ourproofs. Finally, we identify compounding structureless noise as a key challengefor using mechanistic interpretability to generate compact proofs on modelperformance.

Quick Read (beta)

loading the full paper ...