Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

Abstract

We show that unsupervised sequence-segmentation performance can betransferred to extremely low-resource languages by pre-training a MaskedSegmental Language Model (Downey et al., 2021) multilingually. Further, we showthat this transfer can be achieved by training over a collection oflow-resource languages that are typologically similar (but phylogeneticallyunrelated) to the target language. In our experiments, we transfer from acollection of 10 Indigenous American languages (AmericasNLP, Mager et al.,2021) to K'iche', a Mayan language. We compare our model to a monolingualbaseline, and show that the multilingual pre-trained approach yields much moreconsistent segmentation quality across target dataset sizes, including azero-shot performance of 20.6 F1, and exceeds the monolingual performance in9/10 experimental settings. These results have promising implications forlow-resource NLP pipelines involving human-like linguistic units, such as thesparse transcription framework proposed by Bird (2020).

Quick Read (beta)

loading the full paper ...