Reducing Sentiment Bias in Language Models via Counterfactual Evaluation

Abstract

Recent improvements in large-scale language models have driven progress onautomatic generation of syntactically and semantically consistent text for manyreal-world applications. Many of these advances leverage the availability oflarge corpora. While training on such corpora encourages the model tounderstand long-range dependencies in text, it can also result in the modelsinternalizing the social biases present in the corpora. This paper aims toquantify and reduce biases exhibited by language models. Given a conditioningcontext (e.g. a writing prompt) and a language model, we analyze if (and how)the sentiment of the generated text is affected by changes in values ofsensitive attributes (e.g. country names, occupations, genders, etc.) in theconditioning context, a.k.a. counterfactual evaluation. We quantify thesebiases by adapting individual and group fairness metrics from the fair machinelearning literature. Extensive evaluation on two different corpora (newsarticles and Wikipedia) shows that state-of-the-art Transformer-based languagemodels exhibit biases learned from data. We propose embedding-similarity andsentiment-similarity regularization methods that improve both individual andgroup fairness metrics without sacrificing perplexity and semanticsimilarity---a positive step toward development and deployment of fairerlanguage models for real-world applications.

Quick Read (beta)

loading the full paper ...