Attention-Only Transformers and Implementing MLPs with Attention Heads

Abstract

The transformer architecture is widely used in machine learning models andconsists of two alternating sublayers: attention heads and MLPs. We prove thatan MLP neuron can be implemented by a masked attention head with internaldimension 1 so long as the MLP's activation function comes from a restrictedclass including SiLU and close approximations of ReLU and GeLU. This allows oneto convert an MLP-and-attention transformer into an attention-only transformerat the cost of greatly increasing the number of attention heads. We also provethat attention heads can perform the components of an MLP (lineartransformations and activation functions) separately. Finally, we prove thatattention heads can encode arbitrary masking patterns in their weight matricesto within arbitrarily small error.

Quick Read (beta)

loading the full paper ...