Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Abstract

The widespread presence of offensive language on social media motivated thedevelopment of systems capable of recognizing such content automatically. Apartfrom a few notable exceptions, most research on automatic offensive languageidentification has dealt with English. To address this shortcoming, weintroduce MOLD, the Marathi Offensive Language Dataset. MOLD is the firstdataset of its kind compiled for Marathi, thus opening a new domain forresearch in low-resource Indo-Aryan languages. We present results from severalmachine learning experiments on this dataset, including zero-short and othertransfer learning experiments on state-of-the-art cross-lingual transformersfrom existing data in Bengali, English, and Hindi.

Quick Read (beta)

loading the full paper ...