That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Abstract

Only a handful of the world's languages are abundant with the resources thatenable practical applications of speech processing technologies. One of themethods to overcome this problem is to use the resources existing in otherlanguages to train a multilingual automatic speech recognition (ASR) model,which, intuitively, should learn some universal phonetic representations. Inthis work, we focus on gaining a deeper understanding of how general theserepresentations might be, and how individual phones are getting improved in amultilingual setting. To that end, we select a phonetically diverse set oflanguages, and perform a series of monolingual, multilingual and crosslingual(zero-shot) experiments. The ASR is trained to recognize the InternationalPhonetic Alphabet (IPA) token sequences. We observe significant improvementsacross all languages in the multilingual setting, and stark degradation in thecrosslingual setting, where the model, among other errors, considers Javaneseas a tone language. Notably, as little as 10 hours of the target languagetraining data tremendously reduces ASR error rates. Our analysis uncovered thateven the phones that are unique to a single language can benefit greatly fromadding training data from other languages - an encouraging result for thelow-resource speech community.

Quick Read (beta)

loading the full paper ...