Tracr: Compiled Transformers as a Laboratory for Interpretability

Abstract

Interpretability research aims to build tools for understanding machinelearning (ML) models. However, such tools are inherently hard to evaluatebecause we do not have ground truth information about how ML models actuallywork. In this work, we propose to build transformer models manually as atestbed for interpretability research. We introduce Tracr, a "compiler" fortranslating human-readable programs into weights of a transformer model. Tracrtakes code written in RASP, a domain-specific language (Weiss et al. 2021), andtranslates it into weights for a standard, decoder-only, GPT-like transformerarchitecture. We use Tracr to create a range of ground truth transformers thatimplement programs including computing token frequencies, sorting, and Dyck-nparenthesis checking, among others. To enable the broader research community toexplore and use compiled models, we provide an open-source implementation ofTracr at https://github.com/deepmind/tracr.

Quick Read (beta)

loading the full paper ...