Sadeed: Advancing Arabic Diacritization Through Small Language Model

  • 2025-08-21 05:56:26
  • Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan
  • 0

Abstract

Arabic text diacritization remains a persistent challenge in natural languageprocessing due to the language's morphological richness. In this paper, weintroduce Sadeed, a novel approach based on a fine-tuned decoder-only languagemodel adapted from Kuwain 1.5B Hennara et al. [2025], a compact modeloriginally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefullycurated, high-quality diacritized datasets, constructed through a rigorousdata-cleaning and normalization pipeline. Despite utilizing modestcomputational resources, Sadeed achieves competitive results compared toproprietary large language models and outperforms traditional models trained onsimilar domains. Additionally, we highlight key limitations in currentbenchmarking practices for Arabic diacritization. To address these issues, weintroduce SadeedDiac-25, a new benchmark designed to enable fairer and morecomprehensive evaluation across diverse text genres and complexity levels.Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancingArabic NLP applications, including machine translation, text-to-speech, andlanguage learning tools.

 

Quick Read (beta)

loading the full paper ...