Learning pronunciation from a foreign language in speech synthesis networks

Abstract

Although there are more than 65,000 languages in the world, thepronunciations of many phonemes sound similar across the languages. When peoplelearn a foreign language, their pronunciation often reflects their nativelanguage's characteristics. This motivates us to investigate how the speechsynthesis network learns the pronunciation from datasets from differentlanguages. In this study, we are interested in analyzing and taking advantageof multilingual speech synthesis network. First, we train the speech synthesisnetwork bilingually in English and Korean and analyze how the network learnsthe relations of phoneme pronunciation between the languages. Our experimentalresult shows that the learned phoneme embedding vectors are located closer iftheir pronunciations are similar across the languages. Consequently, thetrained networks can synthesize the English speakers' Korean speech and viceversa. Using this result, we propose a training framework to utilizeinformation from a different language. To be specific, we pre-train a speechsynthesis network using datasets from both high-resource language andlow-resource language, then we fine-tune the network using the low-resourcelanguage dataset. Finally, we conducted more simulations on 10 differentlanguages to show it is generally extendable to other languages.

Quick Read (beta)

loading the full paper ...