Replacing softmax with ReLU in Vision Transformers

  • 2023-09-15 18:43:40
  • Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith
  • 0

Abstract

Previous research observed accuracy degradation when replacing the attentionsoftmax with a point-wise activation such as ReLU. In the context of visiontransformers, we find that this degradation is mitigated when dividing bysequence length. Our experiments training small to large vision transformers onImageNet-21k indicate that ReLU-attention can approach or match the performanceof softmax-attention in terms of scaling behavior as a function of compute.

 

Quick Read (beta)

loading the full paper ...