Abstract
Model interpretability is a crucial problem in neural networks. Existinginterpretation methods highlight salient input features, often determining eachfeature's importance based on gradient information from the model. We insteadremove the least influential words, one at a time, from language inputs. Thisexposes pathological model behavior on language tasks: models produce highconfidence values for reduced inputs, even when humans find them nonsensical.We examine the reasons for this behavior and suggest methods of mitigation. Ourresults have implications for gradient-based interpretation methods, showingthat determining word importance using a model's gradient often does not alignwith humans' perceived importance of that word. We propose a simple entropyregularization technique that mitigates these issues without affectingperformance on clean examples.