The past year has witnessed the rapid development of applying the Transformermodule to vision problems. While some researchers have demonstrated thatTransformer-based models enjoy a favorable ability of fitting data, there arestill growing number of evidences showing that these models suffer over-fittingespecially when the training data is limited. This paper offers an empiricalstudy by performing step-by-step operations to gradually transit aTransformer-based model to a convolution-based model. The results we obtainduring the transition process deliver useful messages for improving visualrecognition. Based on these observations, we propose a new architecture namedVisformer, which is abbreviated from the `Vision-friendly Transformer'. Withthe same computational complexity, Visformer outperforms both theTransformer-based and convolution-based models in terms of ImageNetclassification accuracy, and the advantage becomes more significant when themodel complexity is lower or the training set is smaller. The code is availableat https://github.com/danczs/Visformer.