Explaining Prediction Uncertainty of Pre-trained Language Models by Detecting Uncertain Words in Inputs

Abstract

Estimating the predictive uncertainty of pre-trained language models isimportant for increasing their trustworthiness in NLP. Although many previousworks focus on quantifying prediction uncertainty, there is little work onexplaining the uncertainty. This paper pushes a step further on explaininguncertain predictions of post-calibrated pre-trained language models. We adapttwo perturbation-based post-hoc interpretation methods, Leave-one-out andSampling Shapley, to identify words in inputs that cause the uncertainty inpredictions. We test the proposed methods on BERT and RoBERTa with three tasks:sentiment classification, natural language inference, and paraphraseidentification, in both in-domain and out-of-domain settings. Experiments showthat both methods consistently capture words in inputs that cause predictionuncertainty.

Quick Read (beta)

loading the full paper ...