On Importance of Code-Mixed Embeddings for Hate Speech Identification

Abstract

Code-mixing is the practice of using two or more languages in a singlesentence, which often occurs in multilingual communities such as India wherepeople commonly speak multiple languages. Classic NLP tools, trained onmonolingual data, face challenges when dealing with code-mixed data. Extractingmeaningful information from sentences containing multiple languages becomesdifficult, particularly in tasks like hate speech detection, due to linguisticvariation, cultural nuances, and data sparsity. To address this, we aim toanalyze the significance of code-mixed embeddings and evaluate the performanceof BERT and HingBERT models (trained on a Hindi-English corpus) in hate speechdetection. Our study demonstrates that HingBERT models, benefiting fromtraining on the extensive Hindi-English dataset L3Cube-HingCorpus, outperformBERT models when tested on hate speech text datasets. We also found thatcode-mixed Hing-FastText performs better than standard English FastText andvanilla BERT models.

Quick Read (beta)

loading the full paper ...