Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Abstract

This research addresses the problem of acoustic modeling of low-resourcelanguages for which transcribed training data is absent. The goal is to learnrobust frame-level feature representations that can be used to identify anddistinguish subword-level speech units. The proposed feature representationscomprise various types of multilingual bottleneck features (BNFs) that areobtained via multi-task learning of deep neural networks (MTL-DNN). One of thekey problems is how to acquire high-quality frame labels for untranscribedtraining data to facilitate supervised DNN training. It is shown that learningof robust BNF representations can be achieved by effectively leveragingtranscribed speech data and well-trained automatic speech recognition (ASR)systems from one or more out-of-domain (resource-rich) languages. Out-of-domainASR systems can be applied to perform speaker adaptation with untranscribedtraining data of the target language, and to decode the training speech intoframe-level labels for DNN training. It is also found that better frame labelscan be generated by considering temporal dependency in speech when performingframe clustering. The proposed methods of feature learning are evaluated on thestandard task of unsupervised subword modeling in Track 1 of the ZeroSpeech2017 Challenge. The best performance achieved by our system is $9.7\%$ in termsof across-speaker triphone minimal-pair ABX error rate, which is comparable tothe best systems reported recently. Lastly, our investigation reveals that thecloseness between target languages and out-of-domain languages and the amountof available training data for individual target languages could havesignificant impact on the goodness of learned features.

Quick Read (beta)

loading the full paper ...