Visformer: The Vision-friendly Transformer

  • 2021-04-26 13:13:03
  • Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, Qi Tian
The past year has witnessed the rapid development of applying the Transformermodule to vision problems. While some researchers have demonstrated thatTransformer-based models enjoy a favorable ability of fitting data, there arestill growing number of evidences showing that these models suffer over-fittingespecially when the training data is limited. This paper offers an empiricalstudy by performing step-by-step operations to gradually transit aTransformer-based model to a convolution-based model. The results we obtainduring the transition process deliver useful messages for improving visualrecognition. Based on these observations, we propose a new architecture namedVisformer, which is abbreviated from the `Vision-friendly Transformer'. Withthe same computational complexity, Visformer outperforms both theTransformer-based and convolution-based models in terms of ImageNetclassification accuracy, and the advantage becomes more significant when themodel complexity is lower or the training set is smaller. The code is availableat


