How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

Abstract

Despite the recent advancements of attention-based deep learningarchitectures across a majority of Natural Language Processing tasks, theirapplication remains limited in a low-resource setting because of a lack ofpre-trained models for such languages. In this study, we make the first attemptto investigate the challenges of adapting these techniques for an extremelylow-resource language -- Sumerian cuneiform -- one of the world's oldestwritten languages attested from at least the beginning of the 3rd millenniumBC. Specifically, we introduce the first cross-lingual information extractionpipeline for Sumerian, which includes part-of-speech tagging, named entityrecognition, and machine translation. We further curate InterpretLR, aninterpretability toolkit for low-resource NLP, and use it alongside humanattributions to make sense of the models. We emphasize on human evaluations togauge all our techniques. Notably, most components of our pipeline can begeneralised to any other language to obtain an interpretable execution of thetechniques, especially in a low-resource setting. We publicly release allsoftware, model checkpoints, and a novel dataset with domain-specificpre-processing to promote further research.

Quick Read (beta)

loading the full paper ...