Offense Detection in Dravidian Languages using Code-Mixing Index based Focal Loss

Abstract

Over the past decade, we have seen exponential growth in online contentfueled by social media platforms. Data generation of this scale comes with thecaveat of insurmountable offensive content in it. The complexity of identifyingoffensive content is exacerbated by the usage of multiple modalities (image,language, etc.), code-mixed language and more. Moreover, even after carefulsampling and annotation of offensive content, there will always exist asignificant class imbalance between offensive and non-offensive content. Inthis paper, we introduce a novel Code-Mixing Index (CMI) based focal loss whichcircumvents two challenges (1) code-mixing in languages (2) class imbalanceproblem for Dravidian language offense detection. We also replace theconventional dot product-based classifier with the cosine-based classifierwhich results in a boost in performance. Further, we use multilingual modelsthat help transfer characteristics learnt across languages to work effectivelywith low resourced languages. It is also important to note that our modelhandles instances of mixed script (say usage of Latin and Dravidian-Tamilscript) as well. To summarize, our model can handle offensive languagedetection in a low-resource, class imbalanced, multilingual and code-mixedsetting.

Quick Read (beta)

loading the full paper ...