CATT: Character-based Arabic Tashkeel Transformer

  • 2024-07-04 18:06:33
  • Faris Alasmary, Orjuwan Zaafarani, Ahmad Ghannam
  • 0

Abstract

Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances thecomprehension of Arabic text by removing ambiguity and minimizing the risk ofmisinterpretations caused by its absence. It plays a crucial role in improvingArabic text processing, particularly in applications such as text-to-speech andmachine translation. This paper introduces a new approach to training ATDmodels. First, we finetuned two transformers, encoder-only and encoder-decoder,that were initialized from a pretrained character-based BERT. Then, we appliedthe Noisy-Student approach to boost the performance of the best model. Weevaluated our models alongside 11 commercial and open-source models using twomanually labeled benchmark datasets: WikiNews and our CATT dataset. Ourfindings show that our top model surpasses all evaluated models by relativeDiacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT,respectively, achieving state-of-the-art in ATD. In addition, we show that ourmodel outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. Weopen-source our CATT models and benchmark dataset for the researchcommunity\footnote{https://github.com/abjadai/catt}.

 

Quick Read (beta)

loading the full paper ...