Using Honeypots to Catch Adversarial Attacks on Neural Networks

  • 2020-03-25 14:43:21
  • Shawn Shan, Emily Wenger, Bolun Wang, Bo Li, Haitao Zheng, Ben Y. Zhao
  • 0

Abstract

Deep neural networks (DNN) are known to be vulnerable to adversarial attacks.Numerous efforts either try to patch weaknesses in trained models, or try tomake it difficult or costly to compute adversarial examples that exploit them.In our work, we explore a new "honeypot" approach to protect DNN models. Weintentionally inject trapdoors, honeypot weaknesses in the classificationmanifold that attract attackers searching for adversarial examples. Attackers'optimization algorithms gravitate towards trapdoors, leading them to produceattacks similar to trapdoors in the feature space. Our defense then identifiesattacks by comparing neuron activation signatures of inputs to those oftrapdoors. In this paper, we introduce trapdoors and describe an implementationof a trapdoor-enabled defense. First, we analytically prove that trapdoorsshape the computation of adversarial attacks so that attack inputs will havefeature representations very similar to those of trapdoors. Second, weexperimentally show that trapdoor-protected models can detect, with highaccuracy, adversarial examples generated by state-of-the-art attacks (ProjectedGradient Descent, optimization-based CW, Elastic Net, BPDA), with negligibleimpact on normal classification. These results generalize across classificationdomains, including image, facial, and traffic-sign recognition. We alsovalidate trapdoors' robustness against strong adaptive attacks(countermeasures), including those who can identify and unlearn trapdoors.

 

Quick Read (beta)

This feature is not avaialbe for this paper.