Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

Abstract

How can large language models (LLMs) process and translate endangeredlanguages? Many languages lack a large corpus to train a decent LLM; thereforeexisting LLMs rarely perform well in unseen, endangered languages. On thecontrary, we observe that 2000 endangered languages, though without a largecorpus, have a grammar book or a dictionary. We propose LINGOLLM, atraining-free approach to enable an LLM to process unseen languages that hardlyoccur in its pre-training. Our key insight is to demonstrate linguisticknowledge of an unseen language in an LLM's prompt, including a dictionary, agrammar book, and morphologically analyzed input text. We implement LINGOLLM ontop of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasksacross 8 endangered or low-resource languages. Our results show that LINGOLLMelevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 languagedirections. Our findings demonstrate the tremendous value of linguisticknowledge in the age of LLMs for endangered languages. Our data, code, andmodel generations can be found at https://github.com/LLiLab/llm4endangeredlang.

Quick Read (beta)

loading the full paper ...