The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

  • 2025-07-24 19:28:33
  • Abdulhady Abas Abdullah, Amir H. Gandomi, Tarik A Rashid, Seyedali Mirjalili, Laith Abualigah, Milena Živković, Hadi Veisi
  • 0

Abstract

In natural language processing, multilingual models like mBERT andXLM-RoBERTa promise broad coverage but often struggle with languages that sharea script yet differ in orthographic norms and cultural context. This issue isespecially notable in Arabic-script languages such as Kurdish Sorani, Arabic,Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family:four RoBERTa-based models, each pre-trained on a large corpus tailored to itsspecific language. By focusing pre-training on language-specific scriptfeatures and statistics, our models capture patterns overlooked bygeneral-purpose models. When fine-tuned on classification tasks, AS-RoBERTavariants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. Anablation study confirms that script-focused pre-training is central to thesegains. Error analysis using confusion matrices shows how shared script traitsand domain-specific content affect performance. Our results highlight the valueof script-aware specialization for languages using the Arabic script andsupport further work on pre-training strategies rooted in script and languagespecificity.

 

Quick Read (beta)

loading the full paper ...