Learning pronunciation from a foreign language in speech synthesis networks

Abstract

Although there are more than 6,500 languages in the world, the pronunciationsof many phonemes sound similar across the languages. When people learn aforeign language, their pronunciation often reflects their native language'scharacteristics. This motivates us to investigate how the speech synthesisnetwork learns the pronunciation from datasets from different languages. Inthis study, we are interested in analyzing and taking advantage of multilingualspeech synthesis network. First, we train the speech synthesis networkbilingually in English and Korean and analyze how the network learns therelations of phoneme pronunciation between the languages. Our experimentalresult shows that the learned phoneme embedding vectors are located closer iftheir pronunciations are similar across the languages. Consequently, thetrained networks can synthesize the English speakers' Korean speech and viceversa. Using this result, we propose a training framework to utilizeinformation from a different language. To be specific, we pre-train a speechsynthesis network using datasets from both high-resource language andlow-resource language, then we fine-tune the network using the low-resourcelanguage dataset. Finally, we conducted more simulations on 10 differentlanguages to show it is generally extendable to other languages.

Quick Read (beta)

loading the full paper ...