Abusive and Threatening Language Detection in Urdu using Boosting based and BERT based models: A Comparative Approach

Abstract

Online hatred is a growing concern on many social media platforms. To addressthis issue, different social media platforms have introduced moderationpolicies for such content. They also employ moderators who can check the postsviolating moderation policies and take appropriate action. Academicians in theabusive language research domain also perform various studies to detect suchcontent better. Although there is extensive research in abusive languagedetection in English, there is a lacuna in abusive language detection in lowresource languages like Hindi, Urdu etc. In this FIRE 2021 shared task -"HASOC- Abusive and Threatening language detection in Urdu" the organizerspropose an abusive language detection dataset in Urdu along with threateninglanguage detection. In this paper, we explored several machine learning modelssuch as XGboost, LGBM, m-BERT based models for abusive and threatening contentdetection in Urdu based on the shared task. We observed the Transformer modelspecifically trained on abusive language dataset in Arabic helps in getting thebest performance. Our model came First for both abusive and threatening contentdetection with an F1scoreof 0.88 and 0.54, respectively.

Quick Read (beta)

loading the full paper ...