Abstract
We propose a probabilistic interpretation of transformers as unrolledinference steps assuming a probabilistic Laplacian Eigenmaps model from theProbDR framework. Our derivation shows that at initialisation, transformersperform "linear" dimensionality reduction. We also show that within thetransformer block, a graph Laplacian term arises from our arguments, ratherthan an attention matrix (which we interpret as an adjacency matrix). Wedemonstrate that simply subtracting the identity from the attention matrix (andthereby taking a graph diffusion step) improves validation performance on alanguage model and a simple vision transformer.
Quick Read (beta)
loading the full paper ...