How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods

  • 2019-11-06 17:52:20
  • Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, Himabindu Lakkaraju
  • 58

Abstract

As machine learning black boxes are increasingly being deployed in domainssuch as healthcare and criminal justice, there is growing emphasis on buildingtools and techniques for explaining these black boxes in an interpretablemanner. Such explanations are being leveraged by domain experts to diagnosesystematic errors and underlying biases of black boxes. In this paper, wedemonstrate that post hoc explanations techniques that rely on inputperturbations, such as LIME and SHAP, are not reliable. Specifically, wepropose a novel scaffolding technique that effectively hides the biases of anygiven classifier by allowing an adversarial entity to craft an arbitrarydesired explanation. Our approach can be used to scaffold any biased classifierin such a way that its predictions on the input data distribution still remainbiased, but the post hoc explanations of the scaffolded classifier lookinnocuous. Using extensive evaluation with multiple real-world datasets(including COMPAS), we demonstrate how extremely biased (racist) classifierscrafted by our framework can easily fool popular explanation techniques such asLIME and SHAP into generating innocuous explanations which do not reflect theunderlying biases.

 

Quick Read (beta)

loading the full paper ...