DARCCC: Detecting Adversaries by Reconstruction from Class Conditional Capsules

Abstract

We present a simple technique that allows capsule models to detectadversarial images. In addition to being trained to classify images, thecapsule model is trained to reconstruct the images from the pose parameters andidentity of the correct top-level capsule. Adversarial images do not look likea typical member of the predicted class and they have much largerreconstruction errors when the reconstruction is produced from the top-levelcapsule for that class. We show that setting a threshold on the $l2$ distancebetween the input image and its reconstruction from the winning capsule is veryeffective at detecting adversarial images for three different datasets. Thesame technique works quite well for CNNs that have been trained to reconstructthe image from all or part of the last hidden layer before the softmax. We thenexplore a stronger, white-box attack that takes the reconstruction error intoaccount. This attack is able to fool our detection technique but in order tomake the model change its prediction to another class, the attack musttypically make the "adversarial" image resemble images of the other class.

Quick Read (beta)

loading the full paper ...