On the Validity of Self-Attention as Explanation in Transformer Models

Abstract

Explainability of deep learning systems is a vital requirement for manyapplications. However, it is still an unsolved problem. Recent self-attentionbased models for natural language processing, such as the Transformer or BERT,offer hope of greater explainability by providing attention maps that can bedirectly inspected. Nevertheless, by just looking at the attention maps oneoften overlooks that the attention is not over words but over hiddenembeddings, which themselves can be mixed representations of multipleembeddings. We investigate to what extent the implicit assumption made in manyrecent papers - that hidden embeddings at all layers still correspond to theunderlying words - is justified. We quantify how much embeddings are mixedbased on a gradient based attribution method and find that already after thefirst layer less than 50% of the embedding is attributed to the underlyingword, declining thereafter to a median contribution of 7.5% in the last layer.While throughout the layers the underlying word remains as the one contributingmost to the embedding, we argue that attention visualizations are misleadingand should be treated with care when explaining the underlying deep learningsystem.

Quick Read (beta)

loading the full paper ...