Machine Translation for Ge'ez Language

Abstract

Machine translation (MT) for low-resource languages such as Ge'ez, an ancientlanguage that is no longer the native language of any community, faceschallenges such as out-of-vocabulary words, domain mismatches, and lack ofsufficient labeled training data. In this work, we explore various methods toimprove Ge'ez MT, including transfer-learning from related languages,optimizing shared vocabulary and token segmentation approaches, finetuninglarge pre-trained models, and using large language models (LLMs) for few-shottranslation with fuzzy matches. We develop a multilingual neural machinetranslation (MNMT) model based on languages relatedness, which brings anaverage performance improvement of about 4 BLEU compared to standard bilingualmodels. We also attempt to finetune the NLLB-200 model, one of the mostadvanced translation models available today, but find that it performs poorlywith only 4k training samples for Ge'ez. Furthermore, we experiment with usingGPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches,which leverages embedding similarity-based retrieval to find context examplesfrom a parallel corpus. We observe that GPT-3.5 achieves a remarkable BLEUscore of 9.2 with no initial knowledge of Ge'ez, but still lower than the MNMTbaseline of 15.2. Our work provides insights into the potential and limitationsof different approaches for low-resource and ancient language MT.

Quick Read (beta)

loading the full paper ...