Abstract
Previous research observed accuracy degradation when replacing the attentionsoftmax with a point-wise activation such as ReLU. In the context of visiontransformers, we find that this degradation is mitigated when dividing bysequence length. Our experiments training small to large vision transformers onImageNet-21k indicate that ReLU-attention can approach or match the performanceof softmax-attention in terms of scaling behavior as a function of compute.
Quick Read (beta)
loading the full paper ...