The IgboAPI Dataset: Empowering Igbo Language Technologies through Multi-dialectal Enrichment

  • 2024-05-02 05:27:35
  • Chris Chinenye Emezue, Ifeoma Okoh, Chinedu Mbonu, Chiamaka Chukwuneke, Daisy Lal, Ignatius Ezeani, Paul Rayson, Ijemma Onwuzulike, Chukwuma Okeke, Gerald Nweya, Bright Ogbonna, Chukwuebuka Oraegbunam, Esther Chidinma Awo-Ndubuisi, Akudo Amarachukwu Osuagwu, Obioha Nmezi
  • 0

Abstract

The Igbo language is facing a risk of becoming endangered, as indicated by a2025 UNESCO study. This highlights the need to develop language technologiesfor Igbo to foster communication, learning and preservation. To create robust,impactful, and widely adopted language technologies for Igbo, it is essentialto incorporate the multi-dialectal nature of the language. The primary obstaclein achieving dialectal-aware language technologies is the lack of comprehensivedialectal datasets. In response, we present the IgboAPI dataset, amulti-dialectal Igbo-English dictionary dataset, developed with the aim ofenhancing the representation of Igbo dialects. Furthermore, we illustrate thepracticality of the IgboAPI dataset through two distinct studies: one focusingon Igbo semantic lexicon and the other on machine translation. In the semanticlexicon project, we successfully establish an initial Igbo semantic lexicon forthe Igbo semantic tagger, while in the machine translation study, wedemonstrate that by finetuning existing machine translation systems using theIgboAPI dataset, we significantly improve their ability to handle dialectalvariations in sentences.

 

Quick Read (beta)

loading the full paper ...