Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

Abstract

We present models which complete missing text given transliterations ofancient Mesopotamian documents, originally written on cuneiform clay tablets(2500 BCE - 100 CE). Due to the tablets' deterioration, scholars often rely oncontextual cues to manually fill in missing parts in the text in a subjectiveand time-consuming process. We identify that this challenge can be formulatedas a masked language modelling task, used mostly as a pretraining objective forcontextualized language models. Following, we develop several architecturesfocusing on the Akkadian language, the lingua franca of the time. We find thatdespite data scarcity (1M tokens) we can achieve state of the art performanceon missing tokens prediction (89% hit@5) using a greedy decoding scheme andpretraining on data from other languages and different time periods. Finally,we conduct human evaluations showing the applicability of our models inassisting experts to transcribe texts in extinct languages.

Quick Read (beta)

loading the full paper ...