Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

Abstract

For endangered languages, data collection campaigns have to accommodate thechallenge that many of them are from oral tradition, and producingtranscriptions is costly. Therefore, it is fundamental to translate them into awidely spoken language to ensure interpretability of the recordings. In thispaper we investigate how the choice of translation language affects theposterior documentation work and potential automatic approaches which will workon top of the produced bilingual corpus. For answering this question, we usethe MaSS multilingual speech corpus (Boito et al., 2020) for creating 56bilingual pairs that we apply to the task of low-resource unsupervised wordsegmentation and alignment. Our results highlight that the choice of languagefor translation influences the word segmentation performance, and thatdifferent lexicons are learned by using different aligned translations. Lastly,this paper proposes a hybrid approach for bilingual word segmentation,combining boundary clues extracted from a non-parametric Bayesian model(Goldwater et al., 2009a) with the attentional word segmentation neural modelfrom Godard et al. (2018). Our results suggest that incorporating these cluesinto the neural models' input representation increases their translation andalignment quality, specially for challenging language pairs.

Quick Read (beta)

loading the full paper ...