When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Abstract

Vision Transformers (ViTs) and MLPs signal further efforts on replacinghand-wired features or inductive biases with general-purpose neuralarchitectures. Existing works empower the models by massive data, such aslarge-scale pretraining and/or repeated strong data augmentations, and stillreport optimization-related problems (e.g., sensitivity to initialization andlearning rate). Hence, this paper investigates ViTs and MLP-Mixers from thelens of loss geometry, intending to improve the models' data efficiency attraining and generalization at inference. Visualization and Hessian revealextremely sharp local minima of converged models. By promoting smoothness witha recently proposed sharpness-aware optimizer, we substantially improve theaccuracy and robustness of ViTs and MLP-Mixers on various tasks spanningsupervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and+11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively,with the simple Inception-style preprocessing). We show that the improvedsmoothness attributes to sparser active neurons in the first few layers. Theresultant ViTs outperform ResNets of similar size and throughput when trainedfrom scratch on ImageNet without large-scale pretraining or strong dataaugmentations. They also possess more perceptive attention maps.

Quick Read (beta)

loading the full paper ...