Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages

Abstract

Most social media users come from the Global South, where harmful contentusually appears in local languages. Yet, AI-driven moderation systems strugglewith low-resource languages spoken in these regions. Through semi-structuredinterviews with 22 AI experts working on harmful content detection in fourlow-resource languages: Tamil (South Asia), Swahili (East Africa), MaghrebiArabic (North Africa), and Quechua (South America)--we examine systemic issuesin building automated moderation tools for these languages. Our findings revealthat beyond data scarcity, socio-political factors such as tech companies'monopoly on user data and lack of investment in moderation for low-profitGlobal South markets exacerbate historic inequities. Even if more data wereavailable, the English-centric and data-intensive design of language models andpreprocessing techniques overlooks the need to design for morphologicallycomplex, linguistically diverse, and code-mixed languages. We argue theselimitations are not just technical gaps caused by "data scarcity" but reflectstructural inequities, rooted in colonial suppression of non-Western languages.We discuss multi-stakeholder approaches to strengthen local research capacity,democratize data access, and support language-aware solutions to improveautomated moderation for low-resource languages.

Quick Read (beta)

loading the full paper ...