Manipulating and Measuring Model Interpretability

Abstract

Despite a growing body of research focused on creating interpretable machinelearning methods, there have been few empirical studies verifying whetherinterpretable methods achieve their intended effects on end users. We present aframework for assessing the effects of model interpretability on users viapre-registered experiments in which participants are shown functionallyidentical models that vary in factors thought to influence interpretability.Using this framework, we ran a sequence of large-scale randomized experiments,varying two putative drivers of interpretability: the number of features andthe model transparency (clear or black-box). We measured how these factorsimpact trust in model predictions, the ability to simulate a model, and theability to detect a model's mistakes. We found that participants who were showna clear model with a small number of features were better able to simulate themodel's predictions. However, we found no difference in multiple measures oftrust and found that clear models did not improve the ability to correctmistakes. These findings suggest that interpretability research could benefitfrom more emphasis on empirically verifying that interpretable models achieveall their intended effects.

Quick Read (beta)

loading the full paper ...