Understanding Robustness of Transformers for Image Classification

Abstract

Deep Convolutional Neural Networks (CNNs) have long been the architecture ofchoice for computer vision tasks. Recently, Transformer-based architectureslike Vision Transformer (ViT) have matched or even surpassed ResNets for imageclassification. However, details of the Transformer architecture -- such as theuse of non-overlapping patches -- lead one to wonder whether these networks areas robust. In this paper, we perform an extensive study of a variety ofdifferent measures of robustness of ViT models and compare the findings toResNet baselines. We investigate robustness to input perturbations as well asrobustness to model perturbations. We find that when pre-trained with asufficient amount of data, ViT models are at least as robust as the ResNetcounterparts on a broad range of perturbations. We also find that Transformersare robust to the removal of almost any single layer, and that whileactivations from later layers are highly correlated with each other, theynevertheless play an important role in classification.

Quick Read (beta)

loading the full paper ...