Counteracts: Testing Stereotypical Representation in Pre-trained Language Models

Abstract

Language models have demonstrated strong performance on various naturallanguage understanding tasks. Similar to humans, language models could alsohave their own bias that is learned from the training data. As more and moredownstream tasks integrate language models as part of the pipeline, it isnecessary to understand the internal stereotypical representation and themethods to mitigate the negative effects. In this paper, we proposed a simplemethod to test the internal stereotypical representation in pre-trainedlanguage models using counterexamples. We mainly focused on gender bias, butthe method can be extended to other types of bias. We evaluated models on 9different cloze-style prompts consisting of knowledge and base prompts. Ourresults indicate that pre-trained language models show a certain amount ofrobustness when using unrelated knowledge, and prefer shallow linguistic cues,such as word position and syntactic structure, to alter the internalstereotypical representation. Such findings shed light on how to manipulatelanguage models in a neutral approach for both finetuning and evaluation.

Quick Read (beta)

loading the full paper ...