Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text

  • 2018-10-30 03:43:54
  • Iroro Orife
Yor\`ub\'a is a widely spoken West African language with a writing systemrich in tonal and orthographic diacritics. With very few exceptions, diacriticsare omitted from electronic texts, due to limited device and applicationsupport. Diacritics provide morphological information, are crucial for lexicaldisambiguation, pronunciation and are vital for any Yor\`ub\'a text-to-speech(TTS), automatic speech recognition (ASR) and natural language processing (NLP)tasks. Reframing Automatic Diacritic Restoration (ADR) as a machine translationtask, we experiment with two different attentive Sequence-to-Sequence neuralmodels to process undiacritized text. On our evaluation dataset, this approachproduces diacritization error rates of less than 5%. We have releasedpre-trained models, datasets and source-code as an open-source project toadvance efforts on Yor\`ub\'a language technology.


