Offense Detection in Dravidian Languages using Code-Mixing Index based Focal Loss

Abstract

Over the past decade, we have seen exponential growth in online contentfueled by social media platforms. Data generation of this scale comes with thecaveat of insurmountable offensive content in it. The complexity of identifyingoffensive content is exacerbated by the usage of multiple modalities (image,language, etc.), code mixed language and more. Moreover, even if we carefullysample and annotate offensive content, there will always exist significantclass imbalance in offensive vs non offensive content. In this paper, weintroduce a novel Code-Mixing Index (CMI) based focal loss which circumventstwo challenges (1) code mixing in languages (2) class imbalance problem forDravidian language offense detection. We also replace the conventional dotproduct-based classifier with the cosine-based classifier which results in aboost in performance. Further, we use multilingual models that help transfercharacteristics learnt across languages to work effectively with low resourcedlanguages. It is also important to note that our model handles instances ofmixed script (say usage of Latin and Dravidian - Tamil script) as well. Ourmodel can handle offensive language detection in a low-resource, classimbalanced, multilingual and code mixed setting.

Quick Read (beta)

loading the full paper ...