Abstract
Understanding blackbox machine learning models is important towards theirwidespread adoption. However, developing globally interpretable models thatexplain the behavior of the entire model is challenging. An alternativeapproach is to explain blackbox models through explaining individualprediction using a locally interpretable model. In this paper, we propose anovel method for locally interpretable modeling  Reinforcement LearningbasedLocally Interpretable Modeling (RLLIM). RLLIM employs reinforcement learningto select a small number of samples and distill the blackbox model predictioninto a lowcapacity locally interpretable model. Training is guided with areward that is obtained directly by measuring agreement of the predictions fromthe locally interpretable model with the blackbox model. RLLIM nearmatchesthe overall prediction performance of blackbox models while yieldinghumanlike interpretability, and significantly outperforms state of the artlocally interpretable models in terms of overall prediction performance andfidelity.
Quick Read (beta)
RLLIM: Reinforcement Learningbased
Locally Interpretable Modeling
Abstract
Understanding blackbox machine learning models is important towards their widespread adoption. However, developing globally interpretable models that explain the behavior of the entire model is challenging. An alternative approach is to explain blackbox models through explaining individual prediction using a locally interpretable model. In this paper, we propose a novel method for locally interpretable modeling – Reinforcement Learningbased Locally Interpretable Modeling (RLLIM). RLLIM employs reinforcement learning to select a small number of samples and distill the blackbox model prediction into a lowcapacity locally interpretable model. Training is guided with a reward that is obtained directly by measuring agreement of the predictions from the locally interpretable model with the blackbox model. RLLIM nearmatches the overall prediction performance of blackbox models while yielding humanlike interpretability, and significantly outperforms state of the art locally interpretable models in terms of overall prediction performance and fidelity.
RLLIM: Reinforcement Learningbased
Locally Interpretable Modeling
Jinsung Yoon 

Department of Electrical and 
Computer Engineering, UCLA, CA 
[email protected] 
Sercan Ö. Arık 

Google Cloud AI 
Sunnyvale, CA 
[email protected] 
Tomas Pfister 

Google Cloud AI 
Sunnyvale, CA 
[email protected] 
1 Introduction
Artificial Intelligence (AI) is advancing at a rapid pace, particularly with recent advances in deep neural networks and ensemble methods (Goodfellow et al., 2016; He et al., 2016; Chen & Guestrin, 2016; Ke et al., 2017). This progress has been fueled by ‘blackbox’ machine learning models where the decision making is controlled by complex nonlinear interactions between many parameters that are difficult for humans to understand and interpret. However, in many realworld applications AI systems are not only expected to perform well but are also required to be interpretable: doctors need to understand why a particular treatment is recommended, and financial institutions need to understand why a loan was declined. Use cases of model interpretability vary across applications: it can provide trust to users by showing rationales behind decisions, enable detection of systematic failure cases, and provide actionable feedback for improving models (Rudin, 2018).
Many studies have suggested a tradeoff between performance and interpretability (Virág & Nyitrai, 2014; Johansson et al., 2011). This is correct in that globally interpretable models, which attempt to explain the entire model behavior, typically yield considerably worse performance than ‘blackbox’ models (Lipton, 2016). To go beyond the performance limitations of globally interpretable models, another promising direction is locally interpretable models, which instead of explaining the entire model explain a single prediction (Ribeiro et al., 2016). Methodologically, while a globally interpretable model fits a single inherently interpretable model (such as a linear model or a shallow decision tree) to the entire training set, locally interpretable models aim to fit an inherently interpretable model locally, i.e. for each instance individually, by distilling knowledge from a high performance blackbox model. Such locally interpretable models are very useful for realworld AI deployments to provide succinct and humanlike explanations to users. They can be used to identify systematic failure cases (e.g. by seeking common trends in input dependence for failure cases), detect biases (e.g. by quantifying feature importance for a particular variable), and provide actionable feedback to improve a model (e.g. understand failure cases and what training data to collect).
To be useful in practice, locally interpretable models need to maximize two objectives: (i) the overall prediction performance (how well it predicts compared to the ground truth labels) – for the model to be accurate, and (ii) fidelity (how well it approximates the ‘blackbox’ model predictions) – to ensure the model is reliably approximating the blackbox model’s predictions in the neighborhood of interest (Plumb et al., 2019; Lakkaraju et al., 2019). To this end, a few methods have recently been proposed for locally interpretable modeling: Local Interpretable Modelagnostic Explanations (LIME) (Ribeiro et al., 2016), Supervised Local modeling methods (SILO) (Bloniarz et al., 2016), and Model Agnostic Supervised Local Explanations (MAPLE) (Plumb et al., 2018). LIME in particular has gained notable popularity and has been deployed in many applications due to its simplicity. However, the overall prediction performance and fidelity metrics are not reaching desired levels in many cases (AlvarezMelis & Jaakkola, 2018; Zhang et al., 2019; Ribeiro et al., 2018; Lakkaraju et al., 2017). Indeed, as we show in our experiments, there are frequent cases where existing locally interpretable models even underperform commonly lowperforming globally interpretable models.
One of the fundamental challenges to fit a locally interpretable model is the representational capacity difference while applying distillation. Blackbox machine learning models, such as deep neural networks or ensemble models, have much larger representational capacity than locally interpretable models. This can result in underfitting with conventional distillation techniques, leading to suboptimal performance (Hinton et al., 2015; Wang et al., 2019). We address this fundamental challenge by proposing a novel Reinforcement Learningbased method to fit Locally Interpretable Models which we call RLLIM. RLLIM efficiently utilizes the small representational capacity of locally interpretable models by training with a small number of samples that are determined to have the highest value contribution to the fitting of a locally interpretable model. In order to select these highestvalue instances, we train instancewise weight estimators (modeled with deep neural networks) using a reinforcement signal that quantifies the fidelity metric (i.e. how well does the model approximate the blackbox model predictions). The contributions of this paper can be summarized as:

1.
We introduce the first method that tackles interpretability through dataweighted training, and show that reinforcement learning is highly effective for endtoend training of such a model.

2.
We show that distillation of a blackbox model into a lowcapacity interpretable model can be significantly improved by fitting with a small subset of relevant samples that is controlled efficiently by our method.

3.
On various classification and regression datasets, we demonstrate that RLLIM significantly outperforms alternative models (LIME, SILO and MAPLE) in overall prediction performance and fidelity metrics – in most cases, the overall performance of locally interpretable models obtained by RLLIM is very similar to complex blackbox models.
2 Related Work
Locally interpretable models: There are various approaches to interpret blackbox models – (Gilpin et al., 2018) provides a good overview. One approach is to directly decompose the prediction into feature attributions by considering whatif cases. Shapley values (Štrumbelj & Kononenko, 2014) and their computationallyefficient variants (Lundberg & Lee, 2017) are commonlyused methods in this category. Other notable methods are based on activation differences, e.g. DeepLIFT (Shrikumar et al., 2017), or saliency maps using the gradient flows, e.g. CAM (Zhou et al., 2016) and GradCAM (Selvaraju et al., 2017). In this paper, we focus on the direction of locally interpretable modeling – distilling a blackbox model into an interpretable model for each input instance.
Locally Interpretable Modelagnostic Explanation (LIME) (Ribeiro et al., 2016) is the most popular method for locally interpretable modeling. LIME is based on modifying a data instance by tweaking the feature values and then learning from the impact of the modifications on the output. A fundamental challenge for LIME is the need for a meaningful distance metric to determine neighborhoods, as simple metrics like Euclidean distance may yield poor fidelity in some cases and the estimation can be highlysensitive to normalization (AlvarezMelis & Jaakkola, 2018) especially with categorical variables. Supervised Local modeling methods (SILO) (Bloniarz et al., 2016)) aims to improve LIME by determining the neighborhoods for each instance using adhoc treebased ensemble methods. Model Agnostic Supervised Local Explanations (MAPLE) (Plumb et al., 2018) furthers adds a method for feature selection on top of SILO – it utilizes adhoc treebased ensemble methods to determine the weights of training instances for each target instance and uses the weights to optimize a locally interpretable model. However, SILO and MAPLE still have shortcomings because the treebased ensemble methods are optimized independently from the locally interpretable model – lack of joint optimization results in suboptimal fidelity for the locally interpretable model. Overall, to construct a locally interpretable model, a key problem is how to select the optimal training instances for each testing instance, because the selected training instances mostly determine the constructed locally interpretable model. The number of possibilities for training instance selection is extremely large (exponential in the number of training instances). LIME heuristically utilizes Euclidean distances, whereas SILO and MAPLE use adhoc treebased ensemble methods. Our proposed method, RLLIM, takes a very different perspective: to properly and efficiently explore the large possible solution space, RLLIM utilizes reinforcement learning to find the optimal policy that selects the training instances that maximize the fidelity of the locally interpretable model.
Dataweighted training: Optimal weighing of training data is a paramount problem in machine learning. By upweighting valuable instances and downweighting the low quality or problematic instances, better performance can be obtained in certain learning scenarios, such as imbalanced or noisy labels (Jiang et al., 2018). One approach for data weighting is utilizing Influence Functions (Koh & Liang, 2017), that are based on oracle access to gradients and Hessianvector products. Jointlytrained studentteacher methods constitute another approach (Jiang et al., 2018; Bengio et al., 2009) to learn a datadriven curriculum. Using the feedback from the teacher network, training instancewise weights are learned for the student model. Aligned with our motivations, meta learning is considered for data weighting in Ren et al. (2018). Their proposed method utilizes gradient descentbased meta learning, guided by a small validation set, to maximize the target performance.
In this work we consider dataweighted training for a novel purpose: interpretability. Unlike gradient descentbased meta learning, our approach uses reinforcement learning to integrate the reward directly with the fidelity metric. Aforementioned works estimate the same ranking of training instances for the entire dataset. Instead, our method yields an instancewise ranking of training data points, different for each testing instance. This enables efficient distillation of a blackbox model prediction into a locally interpretable model.
3 Reinforcement Learningbased Modeling
We consider a training dataset $\mathcal{D}=\{({\text{\mathbf{x}}}_{i},{y}_{i}),i=1,\mathrm{\dots},N\}\sim \mathcal{P}$ for training of a blackbox model $f$, where ${\text{\mathbf{x}}}_{i}\in \mathcal{X}$ is the feature vector in a $d$dimensional feature space $\mathcal{X}$ and ${y}_{i}\in \mathcal{Y}$ is the corresponding label in a label space $\mathcal{Y}$. We also assume that there exists a probe dataset ${\mathcal{D}}^{p}=\{({\text{\mathbf{x}}}_{j}^{p},{y}_{j}^{p}),j=1,\mathrm{\dots},M\}\sim \mathcal{P}$ where $M$ is the number of probe instances. The probe dataset is used to evaluate the model performance to guide metalearning as in Ren et al. (2018). If there is no explicit probe dataset, we can randomly partition a subset of the training dataset as the probe dataset and the remainder as the training dataset. RLLIM is composed of three models:

1.
Blackbox model $f:\mathcal{X}\to \mathcal{Y}$ – any machine learning model that needs to be explained (e.g. a deep neural network or a decision treebased ensemble model),

2.
Locally interpretable model ${g}_{\theta}:\mathcal{X}\to \mathcal{Y}$ – an inherently interpretable model by design (e.g. a linear model or a shallow decision tree),

3.
Instancewise weight estimation model ${h}_{\varphi}:\mathcal{X}\times \mathcal{X}\times \mathcal{Y}\to [0,1]$ – a function that outputs the instancewise weights to fit the locally interpretable model. It uses concatenation of a probe feature, a training feature, and a corresponding blackbox model prediction on the training feature as its inputs. It can be a complex machine learning model – e.g. here a deep neural network.
Our objective is to construct an accurate locally interpretable model ${g}_{\theta}$ such that the predictions made by it are similar to the predictions of the given blackbox model ${f}^{*}$ – i.e. the locally interpretable model has high fidelity. We use a loss function, $\mathcal{L}:\mathcal{Y}\times \mathcal{Y}\to \mathbb{R}$ to quantify the fidelity of the locally interpretable model (e.g. mean absolute error, lower the better).
The representational capacity difference between the blackbox model and the locally interpretable model is the bottleneck we aim to address. Ideally, to avoid underfitting, locally interpretable models should be learned with a minimal number of training instances that are most effective in capturing the model behavior. We propose an instancewise weight estimation model ${h}_{\varphi}$ to estimate the probabilities of training instances that should be used for fitting the locally interpretable model. Integrating with the accurate locally interpretable modeling goal, we propose the following objective:
$$\begin{array}{cc}\hfill \underset{{h}_{\varphi}}{\mathrm{min}}& {\mathbb{E}}_{{\text{\mathbf{x}}}^{p}\sim {P}_{X}}\left[\mathcal{L}({f}^{*}({\text{\mathbf{x}}}^{p}),{g}_{\theta ({\text{\mathbf{x}}}^{p})}^{*}({\text{\mathbf{x}}}^{p}))\right]+\lambda {\mathbb{E}}_{{\text{\mathbf{x}}}^{p},\text{\mathbf{x}}\sim {P}_{X}}\left[{h}_{\varphi}({\text{\mathbf{x}}}^{p},\text{\mathbf{x}},{f}^{*}(\text{\mathbf{x}}))\right]\hfill \\ \hfill \text{s.t.}& {g}_{\theta ({\text{\mathbf{x}}}^{p})}^{*}=\mathrm{arg}\underset{{g}_{\theta}}{\mathrm{min}}{\mathbb{E}}_{\text{\mathbf{x}}\sim {P}_{X}}\left[{h}_{\varphi}({\text{\mathbf{x}}}^{p},\text{\mathbf{x}},{f}^{*}(\text{\mathbf{x}}))\times {\mathcal{L}}_{g}({f}^{*}(\text{\mathbf{x}}),{g}_{\theta}(\text{\mathbf{x}}))\right]\hfill \end{array}$$  (1) 
where $\lambda \ge 0$ is a hyperparameter that controls the number of training instances used to fit the locally interpretable model (we study the impact of performance on $\lambda $ in Section 4.2), and ${h}_{\varphi}({\text{\mathbf{x}}}^{p},\text{\mathbf{x}},{f}^{*}(\text{\mathbf{x}}))$ represents the instancewise weight for each training pair $(\text{\mathbf{x}},{f}^{*}(\text{\mathbf{x}}))$ for the probe data ${\text{\mathbf{x}}}^{p}$. ${\mathcal{L}}_{g}$ is the loss function to fit the locally interpretable model, for which we use the mean squared error between predicted values for regression and logits for classification. $\varphi $ and $\theta $ are the trainable parameters, whereas ${f}^{*}$ (the pretrained blackbox model) is fixed.
The first term in the objective function ${\mathbb{E}}_{{\text{\mathbf{x}}}^{p}\sim {P}_{X}}\left[\mathcal{L}({f}^{*}({\text{\mathbf{x}}}^{p}),{g}_{\theta ({\text{\mathbf{x}}}^{p})}^{*}({\text{\mathbf{x}}}^{p}))\right]$ represents the local prediction differences between blackbox model and locally interpretable model (referred to as fidelity metric). The second term in the objective function ${\mathbb{E}}_{{\text{\mathbf{x}}}^{p},\text{\mathbf{x}}\sim {P}_{X}}\left[{h}_{\varphi}({\text{\mathbf{x}}}^{p},\text{\mathbf{x}},{f}^{*}(\text{\mathbf{x}}))\right]$ represents the expected number of selected training points to fit the locally interpretable model. Lastly, the constraint ensures that the locally interpretable model is derived from weighted loss function, where weights are the output of the instancewise weight estimator ${h}_{\varphi}$. Our formulation does not assume any constraint on ${g}_{\theta}$ – it could be any inherently interpretable model suitable for the data type of interest. Next, we describe how Eq. (1) can be efficiently addressed with reinforcement learning.
3.1 Training and inference
The RLLIM method, shown in Fig. 1, can be thought of as encompassing 5 stages:

•
Stage 0 – Blackbox model training: This stage is the preliminary stage for RLLIM. Given the training set $\mathcal{D}$, the blackbox model $f$ is trained to minimize a loss function (${\mathcal{L}}_{f}$) (e.g. mean squared error for regression or crossentropy for classification), i.e., ${f}^{*}=\mathrm{arg}{\mathrm{min}}_{f}\frac{1}{N}{\sum}_{i=1}^{N}{\mathcal{L}}_{f}(f({\text{\mathbf{x}}}_{i}),{y}_{i})$. If the pretrained blackbox model is already saved, we can skip this stage and retrieve the given pretrained blackbox model to ${f}^{*}$.

•
Stage 1 – Auxiliary dataset construction: Using the pretrained blackbox model ${f}^{*}$, we create auxiliary training and probe datasets, as $\widehat{\mathcal{D}}=\{({\text{\mathbf{x}}}_{i},{\widehat{y}}_{i}),i=1,\mathrm{\dots},N\}$ (where ${\widehat{y}}_{i}={f}^{*}({\text{\mathbf{x}}}_{i})$) and ${\widehat{\mathcal{D}}}^{p}=\{({\text{\mathbf{x}}}_{j}^{p},{\widehat{y}}_{j}^{p}),j=1,\mathrm{\dots},M\}$ (where ${\widehat{y}}_{j}^{p}={f}^{*}({\text{\mathbf{x}}}_{j}^{p})$), respectively. These auxiliary datasets ($\widehat{\mathcal{D}}$, ${\widehat{\mathcal{D}}}^{p}$) are used for instancewise weight estimation models and locally interpretable model training.

•
Stage 2 – Interpretable baseline training: To improve the stability of the instancewise weight estimator training, a baseline model is observed to be beneficial. As the baseline model ${g}_{b}:\mathcal{X}\to \mathcal{Y}$, we use a globally interpretable model (such as a linear model or shallow decision tree) optimized to replicate the predictions of the blackbox model: ${g}_{b}^{*}=\mathrm{arg}{\mathrm{min}}_{g}\frac{1}{N}{\sum}_{i=1}^{N}\mathcal{L}(g({\text{\mathbf{x}}}_{i}),{\widehat{y}}_{i})$.

•
Stage 3 – Instancewise weight estimator training: We train an instancewise weight estimator using the auxiliary datasets ($\widehat{\mathcal{D}}$, ${\widehat{\mathcal{D}}}^{p}$). To encourage exploration, we consider probabilistic selection, with a sampler block that is based on the output of the instancewise weight estimator – ${h}_{\varphi}({\text{\mathbf{x}}}_{j}^{p},{\text{\mathbf{x}}}_{i},{\widehat{y}}_{i})$ represents the probability that $({\text{\mathbf{x}}}_{i},{\widehat{y}}_{i})$ is selected to train locally interpretable model for the probe instance ${\text{\mathbf{x}}}_{j}^{p}$. Let the binary vector $\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p})\in {\{0,1\}}^{N}$ represent the selection operation, such that $({\text{\mathbf{x}}}_{i},{\widehat{y}}_{i})$ is selected for training locally interpretable model for ${\text{\mathbf{x}}}_{j}^{p}$ when ${c}_{i}({\text{\mathbf{x}}}_{j}^{p})=1$. Correspondingly, ${\rho}_{\varphi}({\text{\mathbf{x}}}^{p})$ is the probability mass function for $\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p})$ given ${h}_{\varphi}(\cdot )$:
$${\rho}_{\varphi}({\text{\mathbf{x}}}_{j}^{p},\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p}))=\prod _{i=1}^{N}\left[{h}_{\varphi}{({\text{\mathbf{x}}}_{j}^{p},{\text{\mathbf{x}}}_{i},{f}^{*}({\text{\mathbf{x}}}_{i}))}^{{c}_{i}({\text{\mathbf{x}}}_{j}^{p})}\cdot {(1{h}_{\varphi}({\text{\mathbf{x}}}_{j}^{p},{\text{\mathbf{x}}}_{i},{f}^{*}({\text{\mathbf{x}}}_{i})))}^{1{c}_{i}({\text{\mathbf{x}}}_{j}^{p})}\right]$$ As the original form of the optimization problem in Eq. (1) is intractable due to the expectation operations, we employ approximations:

–
The sample mean is used as an approximation of the first term of the objective function as $\frac{1}{M}{\sum}_{j=1}^{M}\mathcal{L}({f}^{*}({\text{\mathbf{x}}}_{j}^{p}),{g}_{\theta ({\text{\mathbf{x}}}_{j}^{p})}^{*}({\text{\mathbf{x}}}_{j}^{p})))$.

–
The second term of the objective, which represents the average selection probability, is approximated as the number of selected instances (divided by $N$) to have ${\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p})}_{1}=\frac{1}{N}{\sum}_{i=1}^{N}{c}_{i}({\text{\mathbf{x}}}_{j}^{p})$.

–
The constraint term is approximated using the sample mean of the training loss as ${g}_{\theta ({\text{\mathbf{x}}}_{j}^{p})}^{*}=\mathrm{arg}{\mathrm{min}}_{{g}_{\theta}}\frac{1}{N}{\sum}_{i=1}^{N}\left[{c}_{i}({\text{\mathbf{x}}}_{j}^{p})\cdot {\mathcal{L}}_{g}({f}^{*}({\text{\mathbf{x}}}_{i}),{g}_{\theta}({\text{\mathbf{x}}}_{i}))\right]$.
The sampler block yields a nondifferential objective, and we cannot train the instancewise weight estimator using conventional gradient descentbased optimization. There are approximations such as training in expectation (Raffel et al., 2017) or Gumbelsoftmax (Jang et al., 2016). Instead, motivated by its many successful applications (Ranzato et al., 2015; Zaremba & Sutskever, 2015; Zhang & Lapata, 2017), we use REINFORCE algorithm (Williams, 1992) such that the selection action is rewarded by the performance of its impact. The loss function for the instancewise weight estimator $l(\varphi )$ is expressed as:
$l(\varphi )$ $={\mathbb{E}}_{{\text{\mathbf{x}}}_{j}^{p}\sim {P}_{X}}[{\mathbb{E}}_{\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p})\sim {\rho}_{\varphi}({\text{\mathbf{x}}}_{j}^{p},\cdot )}[\mathcal{L}({f}^{*}({\text{\mathbf{x}}}_{j}^{p}),{g}_{\theta ({\text{\mathbf{x}}}_{j}^{p})}^{*}({\text{\mathbf{x}}}_{j}^{p})))+\lambda \text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p}){}_{1}]]$ To apply the REINFORCE algorithm, we directly compute the gradient ${\nabla}_{\varphi}\widehat{l}(\varphi )$ as:
${\nabla}_{\varphi}\widehat{l}(\varphi )={\mathbb{E}}_{{\text{\mathbf{x}}}_{j}^{p}\sim {P}_{X}}[{\mathbb{E}}_{\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p})\sim {\rho}_{\varphi}({\text{\mathbf{x}}}_{j}^{p},\cdot )}[\mathcal{L}({f}^{*}({\text{\mathbf{x}}}_{j}^{p}),{g}_{\theta ({\text{\mathbf{x}}}_{j}^{p})}^{*}({\text{\mathbf{x}}}_{j}^{p})))+\lambda \text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p}){}_{1}]{\nabla}_{\varphi}\mathrm{log}{\rho}_{\varphi}({\text{\mathbf{x}}}_{j}^{p},\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p}))]$ Using the gradient ${\nabla}_{\varphi}\widehat{l}(\varphi )$, we employ the following steps iteratively to update the parameters of the instancewise weight estimator $\varphi $:

1.
Estimate instancewise weights ${w}_{i}({\text{\mathbf{x}}}_{j}^{p})={h}_{\varphi}({\text{\mathbf{x}}}_{j}^{p},{\text{\mathbf{x}}}_{i},{\widehat{y}}_{i})$ and instancewise selection vector ${c}_{i}({\text{\mathbf{x}}}_{j}^{p})\sim \text{Ber}({w}_{i}({\text{\mathbf{x}}}_{j}^{p}))$ for each training and probe instance in a minibatch.

2.
Optimize the locally interpretable model with the selection vector for each probe instance:
$${g}_{\theta ({\text{\mathbf{x}}}_{j}^{p})}^{*}=\mathrm{arg}\underset{{g}_{\theta}}{\mathrm{min}}\sum _{i=1}^{N}\left[{c}_{i}({\text{\mathbf{x}}}_{j}^{p})\cdot {\mathcal{L}}_{g}({f}^{*}({\text{\mathbf{x}}}_{i}),{g}_{\theta}({\text{\mathbf{x}}}_{i}))\right]$$ 
3.
Update the instancewise weight estimation model parameter $\varphi $:
$$\varphi \leftarrow \varphi \frac{\alpha}{M}\sum _{j=1}^{M}\left[\mathcal{L}({f}^{*}({\text{\mathbf{x}}}_{j}^{p}),{g}_{\theta ({\text{\mathbf{x}}}_{j}^{p})}^{*}({\text{\mathbf{x}}}_{j}^{p})){\mathcal{L}}_{b}({\text{\mathbf{x}}}_{j}^{p})+\lambda {\text{\mathbf{c}}({\text{\mathbf{x}}}_{j}^{p})}_{1}\right]\cdot {\nabla}_{\varphi}\mathrm{log}{\rho}_{\varphi}({\mathbf{x}}_{j}^{p},\mathbf{c}({\mathbf{x}}_{j}^{p}))$$
where $\alpha >0$ is a learning rate and ${\mathcal{L}}_{b}({\text{\mathbf{x}}}_{j}^{p})=\mathcal{L}({f}^{*}({\text{\mathbf{x}}}_{j}^{p}),{g}_{b}^{*}({\text{\mathbf{x}}}_{j}^{p}))$ is the baseline loss against which we benchmark the performance improvement. We repeat the steps above until convergence.

–

•
Stage 4 – Interpretable inference: Unlike when training, we use a fixed instancewise weight estimator (without the sampler and interpretable baseline) and merely fit the locally interpretable model at inference. Given the test instance ${\text{\mathbf{x}}}^{t}$, we obtain the selection probabilities from the instancewise weight estimator, and using these as the weights, we fit the locally interpretable model via weighted optimization. The outputs of the trained interpretable model are the instancewise predictions and the corresponding explanations (e.g., local dynamics of the blackbox model predictions at ${\text{\mathbf{x}}}^{t}$ given by the coefficients of the fitted linear model).
3.2 Computational cost
In this subsection, we analyze the computational cost of RLLIM for training and inference. As a representative and commonly used example, we assume linear regression as the locally interpretable model, which has a computational complexity of $\mathcal{O}({d}^{2}N)+\mathcal{O}({d}^{3})$ to fit, where $d$ is the number of features and $N$ is the number of training instances. When $N\gg d$ (which is often the case in practice), the training computational complexity is approximated as $\mathcal{O}({d}^{2}N)$ (Tan, 2018).
Training: Given a pretrained blackbox model, Stage 1 involves running inference $N$ times and the total complexity depends on the complexity of the blackbox model. Unless the blackbox model is very complex, the computational complexity of Stage 1 becomes much smaller than Stage 3. Stage 2 has negligible computational overhead. At Stage 3, we iteratively train the instancewise weight estimator and fit the locally interpretable model from scratch using weighted optimization. Therefore, the computational complexity is $\mathcal{O}({d}^{2}N{N}_{I})$ where ${N}_{I}$ is the number of iterations in Stage 3 (typically $$ until convergence). Thus, the training complexity scales roughly linearly with the number of training instances.
Interpretable inference: To infer with the locally interpretable model, we need to fit the locally interpretable model after obtaining the instancewise weights from the trained instancewise weight estimator. Thus, for each testing instance, the computational complexity is $\mathcal{O}({d}^{2}N)$.^{1}^{1} 1 A subset of the training dataset can be used to reduce complexity (with decreased fidelity).
For instance, on a single NVIDIA V100 GPU, on Facebook Comment dataset (consisting $\sim $ 600,000 samples), RLLIM yields a training time of less than 5 hours (including Stage 1, 2 and 3) and an interpretable inference time of less than 10 seconds per a testing instance. On the other hand, LIME results in much longer interpretable inference time (around 30 seconds per a testing instance) due to acquiring a large number of blackbox model predictions for the inputs perturbations, whereas SILO and MAPLE are similar to RLLIM.
4 Experiments
We compare RLLIM to multiple benchmarks on 3 synthetic datasets and 5 UCI public datasets. The sourcecode can be found at https://github.com/googleresearch/googleresearch/tree/master/rllim.
Datasets: The 3 public datasets for regression problems are: (1) \colorblueBlog Feedback, (2) \colorblueFacebook Comment, (3) \colorblueNews Popularity; the other 2 public datasets for classification problems are: (4) \colorblueAdult Income, (5) \colorblueWeather. Details of the data descriptions can be found in the hyperlinks of each dataset (colored in blue). Data statistics can be found in Table 3 in Appendix A. In this section, we mainly focus on the tabular datasets because the local dynamics are more important and useful to explain for them; however, RLLIM method can be generalized to other data types in a straightforward way.
Blackbox models: We focus on approximating blackbox models that are shown to yield competitive performance on the target tasks: 3 treebased ensemble methods (1) \colorblueXGBoost (Chen & Guestrin, 2016), (2) \colorblueLightGBM (Ke et al., 2017), (3) \colorblueRandom Forests (RF) (Breiman, 2001); and deep neural networks (4) \colorblueMultilayer Perceptron (MLP). Also, we use (5) \colorblueRidge Regression (RR) and (6) \colorblueRegression Tree (RT) (for regression) and (7) \colorblueLogistic Regression (LR) and (8) \colorblueDecision Tree (DT) (for classification) as globally interpretable models to benchmark.^{2}^{2} 2 We use python packages (including Sklearn and Tensorflow) to implement those predictive models and the details can be found in the hyperlinks (colored in blue) of each model and Appendix B. We focus on two types of locally interpretable models: (1) Ridge regression, (2) Shallow regression tree (with a max depth of 3). We report the performance with ridge regression for regression and with shallow regression tree for classification in this section. The results of the other two combinations (with ridge regression for classification and with shallow regression tree for regression) are described in Appendix E.
Comparisons to previous work: We compare the performance of RLLIM with three competing methods: (1) Local Interpretable Modelagnostic Explanations (\colorblueLIME) (Ribeiro et al., 2016), (2) Supervised Local modeling methods (\colorblueSILO) (Bloniarz et al., 2016), (3) Model Agnostic Supervised Local Explanations (\colorblueMAPLE) (Plumb et al., 2018).
Performance metrics: To evaluate the performance of locally interpretable models using realworld datasets, we quantify the overall prediction performance and its fidelity. We assume a disjoint testing dataset ${\mathcal{D}}^{t}={\{({\text{\mathbf{x}}}_{k}^{t},{y}_{k}^{t})\}}_{k=1}^{L}$ for evaluation. For the overall prediction performance, we compare the predictions of the locally interpretable models with the groundtruth labels. We use Mean Absolute Error (MAE) for regression and Average Precision Recall (APR) for classification. For fidelity, we compare the outputs (predicted values for regression and logits for classification) of the locally interpretable models and of the blackbox model. We consider two metrics: ${R}^{2}$ score (Legates & McCabe, 1999) and Local MAE (LMAE). The details of the metrics are described in Appendix C.
Implementation details: We implement instancewise weight estimator using a multilayer perceptron with tanh activation. The number of hidden units and layers are optimized by the crossvalidation. In most cases, 5layer perceptron with 100 hidden units performs reasonablywell across all datasets. All features are normalized to be between zero and one, using standard minmax scaler. Categorical variables are transformed using onehot encoding.
4.1 Experiments on synthetic datasets – Recovering local dynamics
On realworld datasets it is challenging to directly evaluate the explanation quality of the locally interpretable models due to the absence of groundtruth explanations. Thus we initially focus on synthetic datasets (with known groundtruth explanations) to directly evaluate how well the locally interpretable models can recover the underlying local dynamics. We construct three synthetic datasets such that the 11dimensional input features $\mathbf{X}$ are sampled from $\mathcal{N}(0,I)$ and $Y$ are:

1.
Syn1: $Y={X}_{1}+2{X}_{2}$ if $$ and $Y={X}_{3}+2{X}_{4}$ if ${X}_{10}\ge 0$

2.
Syn2: $Y={X}_{1}+2{X}_{2}$ if $$ and $Y={X}_{3}+2{X}_{4}$ if ${X}_{10}+{e}^{{X}_{11}}\ge 1$

3.
Syn3: $Y={X}_{1}+2{X}_{2}$ if $$ and $Y={X}_{3}+2{X}_{4}$ if ${X}_{10}+{X}_{11}^{3}\ge 0$
All three datasets have different local dynamics in different input regimes. We directly use the ground truth function as the blackbox model and focus on how well locally interpretable modeling can capture the local dynamics. We evaluate the performance of capturing local dynamics using Absolute Weight Difference (AWD): $\text{AWD}=\text{\mathbf{w}}\widehat{\text{\mathbf{w}}}$, where w is the ground truth coefficients to generate $Y$ and $\widehat{\text{\mathbf{w}}}$ is the derived coefficient from the locally interpretable models. We use the estimated coefficients of the ridge regression as the derived local dynamics ($\widehat{\text{\mathbf{w}}}$).
As shown in Fig. 2, RLLIM significantly outperforms other benchmarks in discovering the local dynamics on all three datasets and in different regimes. RLLIM can actively learn the linear and nonlinear decision boundaries for the local dynamics. Note that LIME completely fails to recover the local dynamics as it uses the Euclidean distance uniformly for all features and cannot distinguish the special properties of the features that alter the local dynamics. SILO and MAPLE only use the predictions to discover the local dynamics; thus, it is hard to discover the decision boundary that depends on the other variables which are independent to the predictions. Fig. 5 in Appendix D shows the learning curves of RLLIM demonstrating the efficiency of reinforcement learning.
4.2 The effect of the number of selected samples on fidelity
In RLLIM, optimal distillation is enabled by using a small subset of training instances to fit the lowcapacity locally interpretable model. The number of selected instances is controlled by $\lambda $ in our method – if $\lambda $ is high/low, RLLIM penalizes more/less on the number of selected instances; thus, less/more instances are selected to construct the locally interpretable model.
We analyze the efficacy of $\lambda $ in controlling the likelihood of selection and the dependency of fidelity on $\lambda $. We expect that if we select a too small/large number of training instances, the locally interpretable model will overfit/underfit which negatively affects the fidelity in both cases. Fig. 3 shows that there is a clear relationship between $\lambda $ and the local fidelity. If $\lambda $ is too large, RLLIM selects too small number of instances; thus, the fitted locally interpretable model is less accurate (due to overfitting). On the other hand, if $\lambda $ is too small, RLLIM selects too large number of instances and deteriorates fidelity (due to underfitting). To achieve the optimal $\lambda $, we conduct crossvalidation experiments and select $\lambda $ which achieves the best validation fidelity (e.g. $\lambda =0.5$ in Syn2). Fig. 3 shows the average selection probability of the training instances for each $\lambda $. As $\lambda $ increases, the average selection probabilities monotonically decrease due to the higher penalty on the number of selected training instances. Note that even using a small portion of training instances, RLLIM can accurately distill the predictions of blackbox models into locally interpretable models which is crucial to understand and interpret the predictions using the most relevant training instances.
4.3 Experiments on real datasets – Overall performance and fidelity
On multiple real datasets, we evaluate the overall prediction performance and fidelity. For the regression and classification problems, we use ridge regression and shallow regression trees as the locally interpretable model. More results can be found in Appendix E.
As can be seen in Table 1, the performance of globally interpretable ridge regression (trained on the entire dataset from the scratch) is much worse than other complex nonlinear models, implying that modeling nonlinear relationships between the features and the labels is important towards high prediction performance. For other locally interpretable modeling methods (LIME, SILO, MAPLE), the performance is far worse than the original blackbox model, showing that they fail at efficiently distilling the nonlinear blackbox models. In some cases (especially on the Facebook dataset), the performance of the benchmarks is even worse than the performance of global ridge regression (highlighted in red), questioning the value of using these locally interpretable models instead of globally interpretable ridge regression.
In contrast, RLLIM achieves similar overall prediction performance to the blackbox models and significantly outperforms global ridge regression. Table 1 also compares the fidelity in terms of ${R}^{2}$ score for regression using ridge regression as the locally interpretable model (LMAE results can be found in Appendix E.3). We observe that ${R}^{2}$ scores for some cases (especially on Facebook dataset and LIME) are negative which represent that the outputs of the locally interpretable models are even worse than the constant mean value estimator. On the other hand, RLLIM achieves higher and positive ${R}^{2}$ values consistently for all datasets and blackbox models than other benchmarks.
Datasets  Models  XGBoost  LightGBM  MLP  RF  

(DTAPR)  Metrics  APR  ${R}^{2}$  APR  ${R}^{2}$  APR  ${R}^{2}$  APR  ${R}^{2}$ 
Original  .8096  1.0  .8254  1.0  .7678  1.0  .7621  1.0  
RLLIM  .8011  .9889  .8114  .9602  .7710  .9451  .7881  .8788  
Adult  LIME  \colorred.6211  .5009  \colorred.6031  .3798  \colorred.4270  .2511  \colorred.6166  .3833 
(.6388)  SILO  .8001  .9869  .8107  .9583  .7708  .9470  .7833  .8548 
MAPLE  .7928  .9794  .8034  .9405  .7719  .9410  .7861  .8622  
Original  .7133  1.0  .7299  1.0  .7205  1.0  .7274  1.0  
RLLIM  .7071  .9734  .7118  .9601  .7099  .9124  .7102  .9008  
Weather  LIME  .6179  .7783  .6159  .6913  \colorred.5651  .3417  .6209  .3534 
(.5838)  SILO  .6991  .9680  .7052  .9452  .6997  .8864  .7042  .8398 
MAPLE  .6973  .9675  .7056  .9446  .6983  .8856  .6983  .8856 
Table 2 shows a similar analysis for classification using shallow regression trees (with max depth of 3) as the locally interpretable model^{3}^{3} 3 Regression trees are used to model logit outputs for classification.. The overall prediction performance of four blackbox models are significantly better than the globally interpretable decision tree which demonstrates the superior fitting by complex blackbox models. Among the locally interpretable models, RLLIM achieves the best APR and ${R}^{2}$ score for most cases, underlining its strength in distilling the predictions of the blackbox model accurately. In some cases, the benchmarks (especially for LIME) achieve lower overall prediction performance than the globally interpretable decision tree (highlighted in red). The overall prediction performance and fidelity metrics of all locally interpretable models seem better for classification problems than regression problems. We expect that the predictions of blackbox models are mostly highly confident, i.e. located near 0 or 1; thus, locally interpretable models can easily distill the predictions of the blackbox models for classification than regression.
4.4 Qualitative analyses – Interpretations of RLLIM on Adult Income dataset
We qualitatively analyze the local explanations provided by RLLIM on the Adult Income dataset (qualitative analyses on Weather dataset can be found in Appendix E.4). Although RLLIM is able to provide local explanations for each individual separately, we analyze its explanations in subgroup granularity for better visualization and understanding (instance granularity analyses are described in Appendix E.4). Fig. 4 represents the feature importance (derived by RLLIM as the local explanations) for five subgroups in predicting the annual income using XGBoost as the blackbox model. We use ridge regression as the locally interpretable model and the absolute value of fitted coefficients as the estimated feature importance. As can be observed in Fig. 4, for age subgroups, capital gain seems much more important for mature people (older than 25) than young people (younger than 25). For education subgroups, capital gain/loss, occupation, and native countries are more critical for highlyeducated people (Doctorate, Profschool, and Masters graduates) than the others. We do not discover notable biases of blackbox models for gender, marital status, and race subgroups.
5 Conclusions
We propose a novel method for locally interpretable modeling of pretrained blackbox models. Our proposed method employs reinforcement learning to select a small number of valuable instances and use them to train a lowcapacity locally interpretable model. The selection mechanism is guided with a reward obtained from the similarity of predictions of the locally interpretable model and the blackbox model. Our approach nearmatches the performance of blackbox models and significantly outperforms alternative techniques in terms of overall prediction performance and fidelity metrics consistently across various datasets and blackbox models.
6 Acknowledgements
Discussions with Besim Avci, Henry Tappen and Zizhao Zhang are gratefully acknowledged.
References
 AlvarezMelis & Jaakkola (2018) David AlvarezMelis and Tommi S Jaakkola. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049, 2018.
 Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In International Conference on Machine Learning, pp. 41–48. ACM, 2009.
 Bloniarz et al. (2016) Adam Bloniarz, Ameet Talwalkar, Bin Yu, and Christopher Wu. Supervised neighborhoods for distributed nonparametric regression. In Artificial Intelligence and Statistics, pp. 1450–1459, 2016.
 Breiman (2001) Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
 Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, 2016.
 Gilpin et al. (2018) L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pp. 80–89, Oct 2018.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations, 2016.
 Jiang et al. (2018) Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2309–2318, 2018.
 Johansson et al. (2011) Ulf Johansson, Cecilia Sönströd, Ulf Norinder, and Henrik Boström. Tradeoff between accuracy and interpretability for predictive in silico modeling. Future medicinal chemistry, 3(6):647–663, 2011.
 Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154, 2017.
 Koh & Liang (2017) Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. In International Conference on Machine Learning, pp. 1885–1894, 2017.
 Lakkaraju et al. (2017) Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Interpretable & explorable approximations of black box models. arXiv preprint arXiv:1707.01154, 2017.
 Lakkaraju et al. (2019) Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 131–138. ACM, 2019.
 Legates & McCabe (1999) David R Legates and Gregory J McCabe. Evaluating the use of “goodnessoffit” measures in hydrologic and hydroclimatic model validation. Water Resources Research, 35(1):233–241, 1999.
 Lipton (2016) Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
 Lundberg & Lee (2017) Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774, 2017.
 Plumb et al. (2018) Gregory Plumb, Denali Molitor, and Ameet S Talwalkar. Model agnostic supervised local explanations. In Advances in Neural Information Processing Systems, pp. 2515–2524, 2018.
 Plumb et al. (2019) Gregory Plumb, Maruan AlShedivat, Eric Xing, and Ameet Talwalkar. Regularizing blackbox models for improved interpretability. arXiv preprint arXiv:1902.06787, 2019.
 Raffel et al. (2017) Colin Raffel, MinhThang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. Online and lineartime attention by enforcing monotonic alignments. In International Conference on Machine Learning, pp. 2837–2846, 2017.
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
 Ren et al. (2018) Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4331–4340, 2018.
 Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.
 Ribeiro et al. (2018) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: Highprecision modelagnostic explanations. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Rudin (2018) Cynthia Rudin. Please Stop Explaining Black Box Models for High Stakes Decisions. arXiv:1811.10154, 2018.
 Selvaraju et al. (2017) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626, 2017.
 Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine LearningVolume, pp. 3145–3153, 2017.
 Štrumbelj & Kononenko (2014) Erik Štrumbelj and Igor Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3):647–665, 2014.
 Tan (2018) PangNing Tan. Introduction to Data Mining. Pearson Education India, 2018.
 Virág & Nyitrai (2014) Miklós Virág and Tamás Nyitrai. Is there a tradeoff between the predictive power and the interpretability of bankruptcy models? the case of the first hungarian bankruptcy prediction model. Acta Oeconomica, 64(4):419–440, 2014.
 Wang et al. (2019) Tongzhou Wang, JunYan Zhu, Antonio Torralba, and Alexei A. Efros. Dataset distillation, 2019. URL https://openreview.net/forum?id=Sy4lojC9tm.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256, 1992.
 Zaremba & Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machinesrevised. arXiv preprint arXiv:1505.00521, 2015.
 Zhang & Lapata (2017) Xingxing Zhang and Mirella Lapata. Sentence simplification with deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 584–594, 2017.
 Zhang et al. (2019) Yujia Zhang, Kuangyan Song, Yiming Sun, Sarah Tan, and Madeleine Udell. “why should you trust my explanation?” understanding uncertainty in lime explanations. arXiv preprint arXiv:1904.12991, 2019.
 Zhou et al. (2016) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, 2016.
Appendix A Data statistics
Problem  Data Name  $\mathrm{\#}$ of samples  $\mathrm{\#}$ of features  Label distribution 
Regression  Blog  60,021  280  6.6 (0022) 
603,713  54  7.2 (0030)  
News  39,644  59  3395.4 (584140010800)  
Classification  Adult  48,842  108  11,687 (23.9%) 
Weather  112,925  61  25,019 (22.2%) 
Appendix B Hyperparameters of the predictive models
In this paper, we use 8 different predictive models. For each predictive model, the corresponding hyperparameters used in the experiments are as follows:

•
XGBoost: booster  gbtree, max depth  6, learning rate  0.3, number of estimators  1000, max depth  6, reg alpha  0

•
LightGBM: booster  gbdt, max depth  None, learning rate  0.1, number of estimators  1000, min data in leaf  20

•
Random Forests: number of estimators  1000, criterion  gini, max depth  None, warm start  False

•
Multilayer Perceptron: Number of layers  4, hidden units  [feature dimensions, feature dimensions/2, feature dimensions/4, feature dimensions/8], activation function  relu, early stoping  True with patient 10, batch size  256, maximum number of epochs  200, optimizer  Adam

•
Ridge Regression: alpha  1

•
Regression Tree: max depth  3, criterion  gini

•
Logistic Regression: solver  lbfgs, no regularization

•
Decision Tree: max depth  3, criterion  gini
We follow the default settings for the other hyperparameters that are not mentioned here.
Appendix C Performance metrics

•
Mean Absolute Error (MAE):
$$\text{MAE}={\mathbb{E}}_{({\text{\mathbf{x}}}^{t},{y}^{t})\sim \mathcal{P}}{g}_{\theta ({\text{\mathbf{x}}}^{t})}^{*}({\text{\mathbf{x}}}^{t}){y}^{t}){}_{1}\simeq \frac{1}{L}\sum {}_{k=1}{}^{L}g{}^{*}{}_{\theta ({\text{\mathbf{x}}}_{k}^{t})}({\text{\mathbf{x}}}_{k}^{t})y{}^{t}{}_{k}{}_{1},$$ 
•
Local MAE (LMAE):
$$\text{LMAE}={\mathbb{E}}_{{\text{\mathbf{x}}}^{t}\sim {\mathcal{P}}_{X}}{g}_{\theta ({\text{\mathbf{x}}}^{t})}^{*}({\text{\mathbf{x}}}^{t}){f}^{*}({\text{\mathbf{x}}}^{t}){}_{1}\simeq \frac{1}{L}\sum _{k=1}^{L}{g}_{\theta ({\text{\mathbf{x}}}_{k}^{t})}^{*}({\text{\mathbf{x}}}_{k}^{t}){f}^{*}({\text{\mathbf{x}}}_{k}^{t})){}_{1},$$ 
•
${R}^{2}$ score (Legates & McCabe, 1999):
$${R}^{2}=1\frac{{\mathbb{E}}_{{\text{\mathbf{x}}}^{t}\sim {\mathcal{P}}_{X}}{{f}^{*}({\text{\mathbf{x}}}^{t}){g}_{\theta ({\text{\mathbf{x}}}^{t})}^{*}({\text{\mathbf{x}}}^{t})}_{2}^{2}}{{\mathbb{E}}_{{\text{\mathbf{x}}}^{t}\sim {\mathcal{P}}_{X}}{{f}^{*}({\text{\mathbf{x}}}^{t}){\mathbb{E}}_{{\widehat{\text{\mathbf{x}}}}^{t}\sim {\mathcal{P}}_{X}}[{f}^{*}({\widehat{\text{\mathbf{x}}}}^{t})]}_{2}^{2}}\simeq 1\frac{\frac{1}{L}{\sum}_{k=1}^{L}{{f}^{*}({\text{\mathbf{x}}}_{k}^{t}){g}_{\theta ({\text{\mathbf{x}}}_{k}^{t})}^{*}({\text{\mathbf{x}}}_{k}^{t})}_{2}^{2}}{\frac{1}{L}{\sum}_{k=1}^{L}{{f}^{*}({\text{\mathbf{x}}}_{k}^{t})\frac{1}{L}{\sum}_{k=1}^{L}[{f}^{*}({\text{\mathbf{x}}}_{k}^{t})]}_{2}^{2}}.$$
If ${R}^{2}=1$, the predictions of the locally interpretable model perfectly match the predictions of the blackbox model. On the other hand, if ${R}^{2}=0$, the locally interpretable model performs as similar as the constant mean value estimator. If $$, the locally interpretable model performs worse than the constant mean value estimator.
Appendix D Learning curves of RLLIM
Appendix E Additional results
E.1 Regression with shallow regression tree as the locally interpretable model
Datasets  Models  XGBoost  LightGBM  MLP  RF  
(RTMAE)  Metrics  MAE  ${R}^{2}$  MAE  ${R}^{2}$  MAE  ${R}^{2}$  MAE  ${R}^{2}$ 
Original  5.131  1.0  4.965  1.0  4.939  1.0  5.203  1.0  
RLLIM  5.121  .8242  4.778  .8939  4.587  .6375  4.652  .8990  
Blog  LIME  \colorred11.80  .2658  \colorred13.22  .1483  \colorred7.396  \colorred.6201  \colorred19.61  \colorred.4116 
(5.955)  SILO  5.149  .8035  4.818  .8816  4.649  .6177  4.715  .8774 
MAPLE  5.329  .7991  5.024  .8660  4.609  .6339  5.016  .8201  
Original  24.18  1.0  20.22  1.0  18.36  1.0  30.09  1.0  
RLLIM  21.82  .9307  21.35  .9194  18.56  .8832  \colorred22.44  .7236  
LIME  \colorred36.69  .3278  \colorred44.21  .1809  \colorred40.85  \colorred.1513  \colorred51.70  .2301  
(22.28)  SILO  \colorred22.42  .8655  \colorred22.33  .7235  19.57  .8566  \colorred24.41  .6917 
MAPLE  22.15  .8824  \colorred23.43  .8581  20.32  .8035  \colorred27.12  .3134  
Original  2995  1.0  3140  1.0  2255  1.0  3378  1.0  
RLLIM  2938  .9382  2504  .4104  2226  .9016  2431  .2768  
News  LIME  \colorred6272  \colorred.6267  \colorred7737  \colorred2.960  2390  .0013  \colorred9637  \colorred7.075 
(3093)  SILO  2910  .1020  2854  .3461  2274  .8201  2874  .2278 
MAPLE  2968  .9288  2846  .3631  2284  .8021  2888  .1872 
Datasets  Models  XGBoost  LightGBM  MLP  RF 
Blog  RLLIM  .7530  1.358  1.273  1.413 
LIME  9.160  11.16  5.006  17.461  
SILO  .8325  1.379  1.178  1.934  
MAPLE  1.029  1.598  1.359  2.158  
RLLIM  7.240  6.867  5.596  15.77  
LIME  31.52  37.75  30.58  45.58  
SILO  8.459  9.149  6.997  18.63  
MAPLE  7.985  8.644  7.290  23.17  
News  RLLIM  389.0  1072  116.6  957.1 
LIME  4455  6243  504.0  9969  
SILO  496.7  1214  160.6  1175  
MAPLE  440.7  1201  163.6  1196 
E.2 Classification with ridge regression as the locally interpretable model
Datasets  Models  XGBoost  LightGBM  MLP  RF  

(LRAPR)  Metrics  APR  ${R}^{2}$  APR  ${R}^{2}$  APR  ${R}^{2}$  APR  ${R}^{2}$ 
Original  .8096  1.0  .8254  1.0  .7678  1.0  .7621  1.0  
RLLIM  .7977  .9871  .8039  .9439  .7670  .9791  .7977  .9217  
Adult  LIME  \colorred.6803  .7195  \colorred.6805  .6259  \colorred.6957  .8310  \colorred.7057  .6759 
(.7553)  SILO  .7912  .9750  .7884  .9301  .7655  .9778  .7664  .9140 
MAPLE  .7947  .9840  .8011  .9386  .7683  .9636  .7958  .8961  
Original  .7133  1.0  .7299  1.0  .7205  1.0  .7274  1.0  
RLLIM  .7140  .9879  .7290  .9801  .7212  .9755  .7331  .9450  
Weather  LIME  \colorred.6376  .7898  \colorred.6392  .6873  \colorred.6395  .5321  \colorred.6387  .4513 
(.7009)  SILO  .7134  .9888  .7281  .9773  .7220  .9797  .7277  .9024 
MAPLE  .7134  .9897  .7273  .9778  .7213  .9702  .7308  .9323 
E.3 Regression with ridge regression as the locally interpretable model  Fidelity analysis in terms of Local MAE (LMAE)
Datasets  Models  XGBoost  LightGBM  MLP  RF 
Blog  RLLIM  .8679  1.135  1.432  1.651 
LIME  6.534  8.037  8.207  17.01  
SILO  2.220  3.046  2.393  3.909  
MAPLE  .9690  1.416  1.550  1.984  
RLLIM  6.394  21.29  8.217  33.64  
LIME  32.57  33.70  27.38  48.03  
SILO  19.51  30.07  11.52  40.14  
MAPLE  7.664  31.25  13.31  44.38  
News  RLLIM  436.9  1049  74.11  905.8 
LIME  3317  4766  327.4  8828  
SILO  657.2  1253  79.85  1345  
MAPLE  500.5  1261  88.19  1157 
E.4 Qualitative analyses – Interpretations of RLLIM on Weather dataset
We qualitatively analyze the local explanations provided by RLLIM on Weather dataset at subgroup granularity. Fig. 6 shows the feature importance for six subgroups in predicting whether it will rain tomorrow, using XGBoost as the blackbox model. We use ridge regression as the locally interpretable model and the absolute value of fitted coefficients as the estimated feature importance. For rain fall subgroups, humidity and wind gust speed seem more important for heavy rain (rain fall $\ge $ 5) than light rain (rain fall $$ 5). For temperature subgroups, rainfall, wind gust speed and humidity are more important for cold days (temperature (at 3pm) $\le $ 10) than warm day (temperature (at 3pm) $\ge $ 20). In general, for heavy rain, fast wind speed, low pressure, and low temperature subgroups, humidity, wind gust speed and rain fall variables are more critical to predict whether it will rain tomorrow than light rain, slow wind speed, high pressure, and high temperature subgroups . We do not discover notable biases of the blackbox model for humidity subgroups.
We further analyze the local explanations provided by RLLIM on the Weather dataset at instance granularity. Fig. 6 represents the feature importance (derived by RLLIM as the local explanations) for 10 instances belong to a subgroup with ‘rain fall $\mathrm{\le}$ 1, wind speed (at 3pm) $\mathrm{\ge}$ 5, and temperature (at 3pm) $\mathrm{>}$ 30’ and the other 10 instances belong to the other subgroup with ‘rain fall $\mathrm{>}$ 15, wind speed (at 3pm) $\mathrm{>}$ 25, and temperature (3pm) $$ 10’. Other experiment settings are the same with the previous analyses in the subgroup granularity. There are clear differences between feature importance of two subgroups (left and right in Fig. 6). Even within the same subgroup, we can observe differences in feature importance across different instances, that are efficiently provided by RLLIM.