On Model Explanations with Transferable Neural Pathways

Abstract

Neural pathways as model explanations consist of a sparse set of neurons thatprovide the same level of prediction performance as the whole model. Existingmethods primarily focus on accuracy and sparsity but the generated pathways mayoffer limited interpretability thus fall short in explaining the modelbehavior. In this paper, we suggest two interpretability criteria of neuralpathways: (i) same-class neural pathways should primarily consist ofclass-relevant neurons; (ii) each instance's neural pathway sparsity should beoptimally determined. To this end, we propose a Generative Class-relevantNeural Pathway (GEN-CNP) model that learns to predict the neural pathways fromthe target model's feature maps. We propose to learn class-relevant informationfrom features of deep and shallow layers such that same-class neural pathwaysexhibit high similarity. We further impose a faithfulness criterion for GEN-CNPto generate pathways with instance-specific sparsity. We propose to transferthe class-relevant neural pathways to explain samples of the same class andshow experimentally and qualitatively their faithfulness and interpretability.

Quick Read (beta)

loading the full paper ...