Adversaral Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights

Abstract

DNN-based image classification models are susceptible to adversarial attacks.Most previous adversarial attacks do not focus on the interpretability of thegenerated adversarial examples, and we cannot gain insights into the mechanismof the target classifier from the attacks. Therefore, we propose AdversarialDoodles, which have interpretable shapes. We optimize black b\'ezier curves tofool the target classifier by overlaying them onto the input image. Byintroducing random perspective transformation and regularizing the doodledarea, we obtain compact attacks that cause misclassification even when humansreplicate them by hand. Adversarial doodles provide describable and intriguinginsights into the relationship between our attacks and the classifier's output.We utilize adversarial doodles and discover the bias inherent in the targetclassifier, such as "We add two strokes on its head, a triangle onto its body,and two lines inside the triangle on a bird image. Then, the classifiermisclassifies the image as a butterfly."

Quick Read (beta)

loading the full paper ...