Perturbation Sensitivity Analysis to Detect Unintended Model Biases

Abstract

Data-driven statistical Natural Language Processing (NLP) techniques leveragelarge amounts of language data to build models that can understand language.However, most language data reflect the public discourse at the time the datawas produced, and hence NLP models are susceptible to learning incidentalassociations around named referents at a particular point in time, in additionto general linguistic meaning. An NLP system designed to model notions such assentiment and toxicity should ideally produce scores that are independent ofthe identity of such entities mentioned in text and their social associations.For example, in a general purpose sentiment analysis system, a phrase such as Ihate Katy Perry should be interpreted as having the same sentiment as I hateTaylor Swift. Based on this idea, we propose a generic evaluation framework,Perturbation Sensitivity Analysis, which detects unintended model biasesrelated to named entities, and requires no new annotations or corpora. Wedemonstrate the utility of this analysis by employing it on two different NLPmodels --- a sentiment model and a toxicity model --- applied on onlinecomments in English language from four different genres.

Quick Read (beta)

loading the full paper ...