Pyramid Adversarial Training Improves ViT Performance

Abstract

Aggressive data augmentation is a key component of the strong generalizationcapabilities of Vision Transformer (ViT). One such data augmentation techniqueis adversarial training; however, many prior works have shown that this oftenresults in poor clean accuracy. In this work, we present Pyramid AdversarialTraining, a simple and effective technique to improve ViT's overallperformance. We pair it with a "matched" Dropout and stochastic depthregularization, which adopts the same Dropout and stochastic depthconfiguration for the clean and adversarial samples. Similar to theimprovements on CNNs by AdvProp (not directly applicable to ViT), our PyramidAdversarial Training breaks the trade-off between in-distribution accuracy andout-of-distribution robustness for ViT and related architectures. It leads to$1.82\%$ absolute improvement on ImageNet clean accuracy for the ViT-B modelwhen trained only on ImageNet-1K data, while simultaneously boostingperformance on $7$ ImageNet robustness metrics, by absolute numbers rangingfrom $1.76\%$ to $11.45\%$. We set a new state-of-the-art for ImageNet-C (41.4mCE), ImageNet-R ($53.92\%$), and ImageNet-Sketch ($41.04\%$) without extradata, using only the ViT-B/16 backbone and our Pyramid Adversarial Training.Our code will be publicly available upon acceptance.

Quick Read (beta)

loading the full paper ...