Causal Proxy Models for Concept-Based Model Explanations

Abstract

Explainability methods for NLP systems encounter a version of the fundamentalproblem of causal inference: for a given ground-truth input text, we nevertruly observe the counterfactual texts necessary for isolating the causaleffects of model representations on outputs. In response, many explainabilitymethods make no use of counterfactual texts, assuming they will be unavailable.In this paper, we show that robust causal explainability methods can be createdusing approximate counterfactuals, which can be written by humans toapproximate a specific counterfactual or simply sampled using metadata-guidedheuristics. The core of our proposal is the Causal Proxy Model (CPM). A CPMexplains a black-box model $\mathcal{N}$ because it is trained to have the sameactual input/output behavior as $\mathcal{N}$ while creating neuralrepresentations that can be intervened upon to simulate the counterfactualinput/output behavior of $\mathcal{N}$. Furthermore, we show that the best CPMfor $\mathcal{N}$ performs comparably to $\mathcal{N}$ in making factualpredictions, which means that the CPM can simply replace $\mathcal{N}$, leadingto more explainable deployed models. Our code is available athttps://github.com/frankaging/Causal-Proxy-Model.

Quick Read (beta)

loading the full paper ...