Marito: Structuring and Building Open Multilingual Terminologies for South African NLP

Abstract

The critical lack of structured terminological data for South Africa'sofficial languages hampers progress in multilingual NLP, despite the existenceof numerous government and academic terminology lists. These valuable assetsremain fragmented and locked in non-machine-readable formats, rendering themunusable for computational research and development. \emph{Marito} addressesthis challenge by systematically aggregating, cleaning, and standardising thesescattered resources into open, interoperable datasets. We introduce thefoundational \emph{Marito} dataset, released under the equitable,Africa-centered NOODL framework. To demonstrate its immediate utility, weintegrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline.Experiments show substantial improvements in the accuracy and domain-specificconsistency of English-to-Tshivenda machine translation for large languagemodels. \emph{Marito} provides a scalable foundation for developing robust andequitable NLP technologies, ensuring South Africa's rich linguistic diversityis represented in the digital age.

Quick Read (beta)

loading the full paper ...