Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

  • 2018-09-11 13:44:48
  • Yonatan Belinkov, Alexander Magidow, Alberto Barrón-Cedeño, Avi Shmidman, Maxim Romanov
  • 2

Abstract

Arabic is a widely-spoken language with a long and rich history, but existingcorpora and language technology focus mostly on modern Arabic and itsvarieties. Therefore, studying the history of the language has so far beenmostly limited to manual analyses on a small scale. In this work, we present alarge-scale historical corpus of the written Arabic language, spanning 1400years. We describe our efforts to clean and process this corpus using ArabicNLP tools, including the identification of reused text. We study the history ofthe Arabic language using a novel automatic periodization algorithm, as well asother techniques. Our findings confirm the established division of writtenArabic into Modern Standard and Classical Arabic, and confirm other establishedperiodizations, while suggesting that written Arabic may be divisible intostill further periods of development.

 

Quick Read (beta)

loading the full paper ...