Abstract
The preprocessing phase is one of the key phases within the textclassification pipeline. This study aims at investigating the impact of thepreprocessing phase on text classification, specifically on offensive languageand hate speech classification for Arabic text. The Arabic language used insocial media is informal and written using Arabic dialects, which makes thetext classification task very complex. Preprocessing helps in dimensionalityreduction and removing useless content. We apply intensive preprocessingtechniques to the dataset before processing it further and feeding it into theclassification model. An intensive preprocessing-based approach demonstratesits significant impact on offensive language detection and hate speechdetection shared tasks of the fourth workshop on Open-Source Arabic Corpora andCorpora Processing Tools (OSACT). Our team wins the third place (3rd) in theSub-Task A Offensive Language Detection division and wins the first place (1st)in the Sub-Task B Hate Speech Detection division, with an F1 score of 89% and95%, respectively, by providing the state-of-the-art performance in terms ofF1, accuracy, recall, and precision for Arabic hate speech detection.