Improve Vision Transformers Training by Suppressing Over-smoothing

Abstract

Introducing the transformer structure into computer vision tasks holds thepromise of yielding a better speed-accuracy trade-off than traditionalconvolution networks. However, directly training vanilla transformers on visiontasks has been shown to yield unstable and sub-optimal results. As a result,recent works propose to modify transformer structures by incorporatingconvolutional layers to improve the performance on vision tasks. This workinvestigates how to stabilize the training of vision transformers\emph{without} special structure modification. We observe that the instabilityof transformer training on vision tasks can be attributed to the over-smoothingproblem, that the self-attention layers tend to map the different patches fromthe input image into a similar latent representation, hence yielding the lossof information and degeneration of performance, especially when the number oflayers is large. We then propose a number of techniques to alleviate thisproblem, including introducing additional loss functions to encouragediversity, prevent loss of information, and discriminate different patches byadditional patch classification loss for Cutmix. We show that our proposedtechniques stabilize the training and allow us to train wider and deeper visiontransformers, achieving 85.0\% top-1 accuracy on ImageNet validation setwithout introducing extra teachers or additional convolution layers. Our codewill be made publicly available athttps://github.com/ChengyueGongR/PatchVisionTransformer .

Quick Read (beta)

loading the full paper ...