Pathologies of Neural Models Make Interpretations Difficult

  • 2018-08-14 17:01:55
  • Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, Jordan Boyd-Graber
  • 0

Abstract

Model interpretability is a crucial problem in neural networks. Existinginterpretation methods highlight salient input features, often determining eachfeature's importance based on gradient information from the model. We insteadremove the least influential words, one at a time, from language inputs. Thisexposes pathological model behavior on language tasks: models produce highconfidence values for reduced inputs, even when humans find them nonsensical.We examine the reasons for this behavior and suggest methods of mitigation. Ourresults have implications for gradient-based interpretation methods, showingthat determining word importance using a model's gradient often does not alignwith humans' perceived importance of that word. We propose a simple entropyregularization technique that mitigates these issues without affectingperformance on clean examples.

 

Quick Read (beta)

loading the full paper ...