Replacing softmax with ReLU in Vision Transformers

Abstract

Previous research observed accuracy degradation when replacing the attentionsoftmax with a point-wise activation such as ReLU. In the context of visiontransformers, we find that this degradation is mitigated when dividing bysequence length. Our experiments training small to large vision transformers onImageNet-21k indicate that ReLU-attention can approach or match the performanceof softmax-attention in terms of scaling behavior as a function of compute.

Quick Read (beta)

loading the full paper ...