Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers

Abstract

We address the critical challenge of applying feature attribution methods tothe transformer architecture, which dominates current applications in naturallanguage processing and beyond. Traditional attribution methods to explainableAI (XAI) explicitly or implicitly rely on linear or additive surrogate modelsto quantify the impact of input features on a model's output. In this work, weformally prove an alarming incompatibility: transformers are structurallyincapable of representing linear or additive surrogate models used for featureattribution, undermining the grounding of these conventional explanationmethodologies. To address this discrepancy, we introduce the Softmax-LinkedAdditive Log Odds Model (SLALOM), a novel surrogate model specifically designedto align with the transformer framework. SLALOM demonstrates the capacity todeliver a range of insightful explanations with both synthetic and real-worlddatasets. We highlight SLALOM's unique efficiency-quality curve by showing thatSLALOM can produce explanations with substantially higher fidelity thancompeting surrogate models or provide explanations of comparable quality at afraction of their computational costs. We release code for SLALOM as anopen-source project online at https://github.com/tleemann/slalom_explanations.

Quick Read (beta)

loading the full paper ...