Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection

Abstract

The perceived toxicity of language can vary based on someone's identity andbeliefs, but this variation is often ignored when collecting toxic languagedatasets, resulting in dataset and model biases. We seek to understand the who,why, and what behind biases in toxicity annotations. In two online studies withdemographically and politically diverse participants, we investigate the effectof annotator identities (who) and beliefs (why), drawing from social psychologyresearch about hate speech, free speech, racist beliefs, political leaning, andmore. We disentangle what is annotated as toxic by considering posts with threecharacteristics: anti-Black language, African American English (AAE) dialect,and vulgarity. Our results show strong associations between annotator identityand beliefs and their ratings of toxicity. Notably, more conservativeannotators and those who scored highly on our scale for racist beliefs wereless likely to rate anti-Black language as toxic, but more likely to rate AAEas toxic. We additionally present a case study illustrating how a populartoxicity detection system's ratings inherently reflect only specific beliefsand perspectives. Our findings call for contextualizing toxicity labels insocial variables, which raises immense implications for toxic languageannotation and detection.

Quick Read (beta)

loading the full paper ...