Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages

Abstract

Unsupervised subword modeling aims to learn low-level representations ofspeech audio in "zero-resource" settings: that is, without using transcriptionsor other resources from the target language (such as text corpora orpronunciation dictionaries). A good representation should capture phoneticcontent and abstract away from other types of variability, such as speakerdifferences and channel noise. Previous work in this area has primarily focusedon learning from target language data only, and has been evaluated onlyintrinsically. Here we directly compare multiple methods, including some thatuse only target language speech data and some that use transcribed speech fromother (non-target) languages, and we evaluate using two intrinsic measures aswell as on a downstream unsupervised word segmentation and clustering task. Wefind that combining two existing target-language-only methods yields betterfeatures than either method alone. Nevertheless, even better results areobtained by extracting target language bottleneck features using a modeltrained on other languages. Cross-lingual training using just one otherlanguage is enough to provide this benefit, but multilingual training helpseven more. In addition to these results, which hold across both intrinsicmeasures and the extrinsic task, we discuss the qualitative differences betweenthe different types of learned features.

Quick Read (beta)

loading the full paper ...