An Adversarial Approach for Explaining the Predictions of Deep Neural Networks

Abstract

Machine learning models have been successfully applied to a wide range ofapplications including computer vision, natural language processing, and speechrecognition. A successful implementation of these models however, usuallyrelies on deep neural networks (DNNs) which are treated as opaque black-boxsystems due to their incomprehensible complexity and intricate internalmechanism. In this work, we present a novel algorithm for explaining thepredictions of a DNN using adversarial machine learning. Our approachidentifies the relative importance of input features in relation to thepredictions based on the behavior of an adversarial attack on the DNN. Ouralgorithm has the advantage of being fast, consistent, and easy to implementand interpret. We present our detailed analysis that demonstrates how thebehavior of an adversarial attack, given a DNN and a task, stays consistent forany input test data point proving the generality of our approach. Our analysisenables us to produce consistent and efficient explanations. We illustrate theeffectiveness of our approach by conducting experiments using a variety ofDNNs, tasks, and datasets. Finally, we compare our work with other well-knowntechniques in the current literature.

Quick Read (beta)

loading the full paper ...