Abstract
Vision transformers (ViTs) have been an alternative design paradigm toconvolutional neural networks (CNNs). However, the training of ViTs is muchharder than CNNs, as it is sensitive to the training parameters, such aslearning rate, optimizer and warmup epoch. The reasons for training difficultyare empirically analysed in ~\cite{xiao2021early}, and the authors conjecturethat the issue lies with the \textit{patchify-stem} of ViT models and proposethat early convolutions help transformers see better. In this paper, we furtherinvestigate this problem and extend the above conclusion: only earlyconvolutions do not help for stable training, but the scaled ReLU operation inthe \textit{convolutional stem} (\textit{conv-stem}) matters. We verify, boththeoretically and empirically, that scaled ReLU in \textit{conv-stem} not onlyimproves training stabilization, but also increases the diversity of patchtokens, thus boosting peak performance with a large margin via adding fewparameters and flops. In addition, extensive experiments are conducted todemonstrate that previous ViTs are far from being well trained, further showingthat ViTs have great potential to be a better substitute of CNNs.