Confidence-Calibrated Adversarial Training: Towards Robust Models Generalizing Beyond the Attack Used During Training

Abstract

Adversarial training is the standard to train models robust againstadversarial examples. However, especially for complex datasets, adversarialtraining incurs a significant loss in accuracy and is known to generalizepoorly to stronger attacks, e.g., larger perturbations or other threat models.In this paper, we introduce confidence-calibrated adversarial training (CCAT)where the key idea is to enforce that the confidence on adversarial examplesdecays with their distance to the attacked examples. We show that CCATpreserves better the accuracy of normal training while robustness againstadversarial examples is achieved via confidence thresholding. Most importantly,in strong contrast to adversarial training, the robustness of CCAT generalizesto larger perturbations and other threat models, not encountered duringtraining. We also discuss our extensive work to design strong adaptive attacksagainst CCAT and standard adversarial training which is of independentinterest. We present experimental results on MNIST, SVHN and Cifar10.

Quick Read (beta)

loading the full paper ...