Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages

Abstract

Most social media users come from non-English speaking countries in theGlobal South. Despite the widespread prevalence of harmful content in theseregions, current moderation systems repeatedly struggle in low-resourcelanguages spoken there. In this work, we examine the challenges AI researchersand practitioners face when building moderation tools for low-resourcelanguages. We conducted semi-structured interviews with 22 AI researchers andpractitioners specializing in automatic detection of harmful content in fourdiverse low-resource languages from the Global South. These are: Tamil fromSouth Asia, Swahili from East Africa, Maghrebi Arabic from North Africa, andQuechua from South America. Our findings reveal that social media companies'restrictions on researchers' access to data exacerbate the historicalmarginalization of these languages, which have long lacked datasets forstudying online harms. Moreover, common preprocessing techniques and languagemodels, predominantly designed for data-rich English, fail to account for thelinguistic complexity of low-resource languages. This leads to critical errorswhen moderating content in Tamil, Swahili, Arabic, and Quechua, which aremorphologically richer than English. Based on our findings, we establish thatthe precarities in current moderation pipelines are rooted in deep systemicinequities and continue to reinforce historical power imbalances. We concludeby discussing multi-stakeholder approaches to improve moderation forlow-resource languages.

Quick Read (beta)

loading the full paper ...