How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages

Abstract

For language documentation initiatives, transcription is an expensiveresource: one minute of audio is estimated to take one hour and a half onaverage of a linguist's work (Austin and Sallabank, 2013). Recently, collectingaligned translations in well-resourced languages became a popular solution forensuring posterior interpretability of the recordings (Adda et al. 2016). Inthis paper we investigate language-related impact in automatic approaches forcomputational language documentation. We translate the bilingual Mboshi-Frenchparallel corpus (Godard et al. 2017) into four other languages, and we performbilingual-rooted unsupervised word discovery. Our results hint towards animpact of the well-resourced language in the quality of the output. However, bycombining the information learned by different bilingual models, we are onlyable to marginally increase the quality of the segmentation.

Quick Read (beta)

loading the full paper ...