Abstract
Arabic handwritten text recognition (HTR) is challenging, especially forhistorical texts, due to diverse writing styles and the intrinsic features ofArabic script. Additionally, Arabic handwriting datasets are smaller comparedto English ones, making it difficult to train generalizable Arabic HTR models.To address these challenges, we propose HATFormer, a transformer-basedencoder-decoder architecture that builds on a state-of-the-art English HTRmodel. By leveraging the transformer's attention mechanism, HATFormer capturesspatial contextual information to address the intrinsic challenges of Arabicscript through differentiating cursive characters, decomposing visualrepresentations, and identifying diacritics. Our customization to historicalhandwritten Arabic includes an image processor for effective ViT informationpreprocessing, a text tokenizer for compact Arabic text representation, and atraining pipeline that accounts for a limited amount of historic Arabichandwriting data. HATFormer achieves a character error rate (CER) of 8.6% onthe largest public historical handwritten Arabic dataset, with a 51%improvement over the best baseline in the literature. HATFormer also attains acomparable CER of 4.2% on the largest private non-historical dataset. Our workdemonstrates the feasibility of adapting an English HTR method to alow-resource language with complex, language-specific challenges, contributingto advancements in document digitization, information retrieval, and culturalpreservation.