CATT: Character-based Arabic Tashkeel Transformer

  • 2024-07-04 18:06:33
  • Faris Alasmary, Orjuwan Zaafarani, Ahmad Ghannam
Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances thecomprehension of Arabic text by removing ambiguity and minimizing the risk ofmisinterpretations caused by its absence. It plays a crucial role in improvingArabic text processing, particularly in applications such as text-to-speech andmachine translation. This paper introduces a new approach to training ATDmodels. First, we finetuned two transformers, encoder-only and encoder-decoder,that were initialized from a pretrained character-based BERT. Then, we appliedthe Noisy-Student approach to boost the performance of the best model. Weevaluated our models alongside 11 commercial and open-source models using twomanually labeled benchmark datasets: WikiNews and our CATT dataset. Ourfindings show that our top model surpasses all evaluated models by relativeDiacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT,respectively, achieving state-of-the-art in ATD. In addition, we show that ourmodel outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. Weopen-source our CATT models and benchmark dataset for the researchcommunity\footnote{}.


