Abstract
This paper presents a comprehensive examination of the impact of tokenizationstrategies and vocabulary sizes on the performance of Arabic language models indownstream natural language processing tasks. Our investigation focused on theeffectiveness of four tokenizers across various tasks, including NewsClassification, Hate Speech Detection, Sentiment Analysis, and Natural LanguageInference. Leveraging a diverse set of vocabulary sizes, we scrutinize theintricate interplay between tokenization approaches and model performance. Theresults reveal that Byte Pair Encoding (BPE) with Farasa outperforms otherstrategies in multiple tasks, underscoring the significance of morphologicalanalysis in capturing the nuances of the Arabic language. However, challengesarise in sentiment analysis, where dialect specific segmentation issues impactmodel efficiency. Computational efficiency analysis demonstrates the stabilityof BPE with Farasa, suggesting its practical viability. Our study uncoverslimited impacts of vocabulary size on model performance while keeping the modelsize unchanged. This is challenging the established beliefs about therelationship between vocabulary, model size, and downstream tasks, emphasizingthe need for the study of models' size and their corresponding vocabulary sizeto generalize across domains and mitigate biases, particularly in dialect baseddatasets. Paper's recommendations include refining tokenization strategies toaddress dialect challenges, enhancing model robustness across diverselinguistic contexts, and expanding datasets to encompass the rich dialect basedArabic. This work not only advances our understanding of Arabic language modelsbut also lays the foundation for responsible and ethical developments innatural language processing technologies tailored to the intricacies of theArabic language.