KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for Detection of Hate Speech and Offensive Code-Mixed Social Media text

Abstract

This paper describes the system submitted by our team, KBCNMUJAL, for Task 2of the shared task Hate Speech and Offensive Content Identification inIndo-European Languages (HASOC), at Forum for Information Retrieval Evaluation,December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languagesViz. Malayalam and Tamil of size 4000 observations, each were shared by theHASOC organizers. These datasets are used to train the machine using differentmachine learning algorithms, based on classification and regression models. Thedatasets consist of tweets or YouTube comments with two class labels offensiveand not offensive. The machine is trained to classify such social mediamessages in these two categories. Appropriate n-gram feature sets are extractedto learn the specific characteristics of the Hate Speech text messages. Thesefeature models are based on TFIDF weights of n-gram. The referred work andrespective experiments show that the features such as word, character andcombined model of word and character n-grams could be used to identify the termpatterns of offensive text contents. As a part of the HASOC shared task, thetest data sets are made available by the HASOC track organizers. The bestperforming classification models developed for both languages are applied ontest datasets. The model which gives the highest accuracy result on trainingdataset for Malayalam language was experimented to predict the categories ofrespective test data. This system has obtained an F1 score of 0.77. Similarlythe best performing model for Tamil language has obtained an F1 score of 0.87.This work has received 2nd and 3rd rank in this shared Task 2 for Malayalam andTamil language respectively. The proposed system is named HASOC_kbcnmujal.

Quick Read (beta)

loading the full paper ...