Abstract
Understanding black-box machine learning models is important towards theirwidespread adoption. However, developing globally interpretable models thatexplain the behavior of the entire model is challenging. An alternativeapproach is to explain black-box models through explaining individualprediction using a locally interpretable model. In this paper, we propose anovel method for locally interpretable modeling - Reinforcement Learning-basedLocally Interpretable Modeling (RL-LIM). RL-LIM employs reinforcement learningto select a small number of samples and distill the black-box model predictioninto a low-capacity locally interpretable model. Training is guided with areward that is obtained directly by measuring agreement of the predictions fromthe locally interpretable model with the black-box model. RL-LIM near-matchesthe overall prediction performance of black-box models while yieldinghuman-like interpretability, and significantly outperforms state of the artlocally interpretable models in terms of overall prediction performance andfidelity.