End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning

  • 2019-07-02 07:43:40
  • Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, Hung-yi Lee
  • 0


End-to-end text-to-speech (TTS) has shown great success on large quantitiesof paired text plus speech data. However, laborious data collection remainsdifficult for at least 95% of the languages over the world, which hinders thedevelopment of TTS in different languages. In this paper, we aim to build TTSsystems for such low-resource (target) languages where only very limited paireddata are available. We show such TTS can be effectively constructed bytransferring knowledge from a high-resource (source) language. Since the modeltrained on source language cannot be directly applied to target language due toinput space mismatch, we propose a method to learn a mapping between source andtarget linguistic symbols. Benefiting from this learned mapping, pronunciationinformation can be preserved throughout the transferring procedure. Preliminaryexperiments show that we only need around 15 minutes of paired data to obtain arelatively good TTS system. Furthermore, analytic studies demonstrated that theautomatically discovered mapping correlate well with the phonetic expertise.


Quick Read (beta)

End-to-end Text-to-speech for Low-resource Languages
by Cross-Lingual Transfer Learning


End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in different languages. In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available. We show such TTS can be effectively constructed by transferring knowledge from a high-resource (source) language. Since the model trained on source language cannot be directly applied to target language due to input space mismatch, we propose a method to learn a mapping between source and target linguistic symbols. Benefiting from this learned mapping, pronunciation information can be preserved throughout the transferring procedure. Preliminary experiments show that we only need around 15 minutes of paired data to obtain a relatively good TTS system. Furthermore, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.

End-to-end Text-to-speech for Low-resource Languages

by Cross-Lingual Transfer Learning

Yuan-Jui Chen*1, Tao Tu*1thanks: *Equal contribution, Cheng-chieh Yeh1, Hung-yi Lee1

1College of Electrical Engineering and Computer Science, National Taiwan University

{r07922070, r07922022, r06942067, hungyilee}@ntu.edu.tw

Index Terms: end-to-end, speech synthesis, transfer learning, cross-lingual, low-resource

1 Introduction

Recent research on end-to-end text-to-speech (TTS) [1, 2, 3, 4, 5, 6] has gained success in terms of human-like and high-quality generated speech. Moreover, with regard to cloning prosody style or speaker characteristics, end-to-end TTS systems also demonstrate a powerful capability [7, 8, 9, 10, 11]. However, training end-to-end TTS systems requires large quantities of text-audio paired data. In order to improve data efficiency, semi-supervised training framework is proposed for Tacotron [1] by leveraging non-parallel large-scale text and speech resources [12]. Nevertheless, there is little discussion on end-to-end TTS for low-resource languages, where only very limited paired data are available.

Previous research on multi-lingual multi-speaker (MLMS) statistical parametric speech synthesis (SPSS) has discussed using high-resource languages to help construct TTS systems for low-resource languages. Some research shows that the model trained on multiple languages can benefit from cross-lingual information and aid the adaptation to new languages using only a small amount of data [13, 14]. In their methods, linguistic inputs of each language are converted internally into language-independent representations. On the contrary, in another work [15], inputs are mapped to the International Phonetic Alphabet (IPA) [16], which is a unified canonical representation. The authors propose a language-agnostic model and also show that the model trained on many languages is sometimes better than the monolingual system built from small amounts of data. Likewise, another work indicates that training data for building a new TTS system can be reduced by pooling phonologically close languages, where a special phoneme inventory is presented for sharing as more regularities across languages as possible [17]. Although previous work demonstrates that utilizing cross-lingual information is beneficial to TTS, this idea has not been widely studied on end-to-end TTS yet.

In this paper, we introduce cross-lingual transfer learning for low-resource languages to end-to-end TTS. We first pretrain a TTS model by leveraging data from high-resource (source) language, and then try to adapt it to low-resource (target) languages. To tackle input space mismatch across languages, we propose a Phonetic Transformation Network (PTN) model to discover a mapping between source and target linguistic symbols according to their pronunciation. The idea is similar to probabilistic phoneme mapping model [18, 19], while our approach is pure deep-learning, and we use connectionist temporal classification (CTC) loss [20] as the training objective. Benefiting from the learned mapping, pronunciation information can be preserved throughout the transferring procedure. Objective and subjective tests show that a few paired data on target language is required for our transfer learning approach to generate intelligible speech11 1 Sound demos can be found at https://henryhenrychen.github.io/CL-transfer-demo. Under the scenario that input linguistic symbols of source and target languages are both phonemes, our approach is competitive with the transfer learning method which uses the handcrafted mapping based on IPA. Furthermore, even when lexicons of target languages are not accessible, our symbol mapping is still applicable and enables TTS to transfer from the source languages with phonemes as input to target languages with characters as input. Finally, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.

2 Proposed approach

Given an input symbol sequence, end-to-end TTS system first transforms each symbol into a symbol embedding by an embedding matrix, and then according to the symbol embeddings, a generative model22 2 For example, sequence-to-sequence model as in Tacotron [1]. outputs the spectrogram or raw waveform. We can formulate end-to-end text-to-speech as

fθ,W:𝒳𝒴 (1)

where θ denotes the parameters of the generative model, W denotes learnable symbol embeddings, and 𝒴 denotes the space of human speech. 𝒳 is the text space for a specific language,

𝒳={{st}t=1T|tst,T} (2)

where is the linguistic symbol set for this language, and T is the length of the input symbol sequence. Our goal is to construct TTS systems for low-resource (target) languages by transferring knowledge from high-resource (source) language. We can directly use θsrc learned from source language to initialize the training of θtgt on target language because both θsrc and θtgt take embeddings as input and generate speech33 3 Here src and tgt stand for source and target, respectively.. However, the same idea cannot be directly applied to Wsrc and Wtgt. An obvious problem is that ssrc and stgt come from different symbol sets, i.e., srctgt. To deal with the input space mismatch problem during the transferring procedure, we present two naive baselines and propose a novel transfer learning approach which utilizes a learned mapping between ssrc and stgt.

Figure 1: Approaches to transfer TTS model from source language to target language. (a) separate symbol space, (b) unified symbol space, and (c) learned symbol space. (d) the training scheme of phonetic transformation network (PTN) for obtaining the learned symbol space.

2.1 Separate symbol space

The first approach simply considers linguistic symbol sets for source and target language as two different symbol sets. In this approach, θtgt is derived by finetuning θsrc, but the target symbol embeddings Wtgt is learned from scratch.

2.2 Unified symbol space

However, some of the sound units are shared by different languages. If we discard Wsrc and train new Wtgt, some useful pronunciation information learned previously may be lost. This can be resolved by mapping src and tgt to a unified symbol set uni, where the mapping is handcrafted and relies on linguistic expertise. In this way, we can use Wsrc to initialize Wtgt because they have the same set of input symbols uni. Note that this method necessitates experts to design symbol mapping for the source and the target language. This kind of mapping is not always available especially when the symbol set of one language is phoneme, while the other is character.

2.3 Learned symbol space

To preserve pronunciation information during transferring while not using linguistic expertise, we propose Phonetic Transformation Network (PTN), a model that can automatically learn how to map source symbols to target symbols according to their sounds.

2.3.1 Phonetic transformation network

First, we pretrain an automatic speech recognition (ASR) system on source language, as illustrated in stage 1 of Figure 1(d). The ASR system learns to output symbol (phoneme) sequence of the source language by CTC loss. Afterward, we fix the pretrained ASR system and concatenate our proposed PTN model with it. PTN can be formulated as

h:𝐩src𝐩tgt (3)

where 𝐩src and 𝐩tgt are probability distributions over src and tgt for a specific timestep. In our case, 𝐩src is also the ASR output symbol posteriorgram. The concatenation of the pretrained ASR system and PTN is then further trained on the target language data by maximizing the log-likelihood of target symbol labellings (phonemes or characters of the target language) using CTC loss, as illustrated in stage 2 of Figure 1(d). In stage 2, the parameters of the ASR system are fixed, so what PTN has learned is to find the most possible target symbols given the ASR output which are source symbols. Since the pretrained ASR system is capable of transcribing an audio frame in target language into a posteriorgram of source symbols, the training in stage 2 enables PTN to learn a strategy to convert the symbols (phoneme) of the source language into the symbols (phonemes or characters) of the target language.

2.3.2 Symbol mapping discovery

With PTN, we can derive the most similar target symbol to a certain source symbol according to their sound. Given the i-th source symbol ssrci, we can simply pass a one-hot vector 𝐨i, whose i-th dimension is marked as 1, to PTN. If the sound of ssrci is shared among source and target language, PTN will convert 𝐨i to a target symbol with high probability. Accordingly, we can map each source symbol to a target symbol by the following formulation.

map(ssrci)={stgtjifhj(𝐨i)>ξ,j=argmax𝑘hk(𝐨i)Noneotherwise (4)

where hk() denotes the k-th output dimension of PTN h(), stgtj denotes j-th symbol in the target language and ξ is the transformation threshold. Once obtaining the mapping, we can transfer the embedding weight of a source symbol to its corresponding target symbol. If a target symbol is mapped by many source symbols, we transfer the embedding weight of the one with the highest probability. For those symbols in the target language which are not mapped by any source symbol, their embedding weights are still learned from scratch.

3 Implementation

3.1 TTS model

In this work, we adopt original Tacotron architecture [1] as our end-to-end TTS model, which has an encoder-decoder architecture with attention mechanism. Spectral analysis setting is also the same as theirs [1] in the paper. Since our goal is to study transfer learning in the small-data regime, we simply use Griffin-Lim [21] as the waveform synthesizer and leave exploring other architectures [3, 8] as our future work.

3.2 ASR and PTN model

A pure-CNN model is adopted for our ASR system, which is modified from the previous work [22]. A pyramidal recurrent neural network (RNN) model [23] was also experimented, whereas we find it performed not as expected in preliminary studies. We conjecture that RNN with multiple layers has learned strong language model on source language, which laid constraints on model’s outputs and hindered the training of subsequent PTN.

As for PTN, it is composed of 3-layer fully connected layers with ReLU activation function. Dropout is also applied with 0.4 dropout rate for each layer.

4 Experiments

To verify whether TTS model can benefit from cross-lingual transfer learning and generate clear speech with small amounts of data, both objective and subjective tests are conducted. For the objective tests, we use google’s cloud speech-to-text API to recognize the generated speech and use the character error rate (CER) as the measurement metric for clarity. Additionally, we also use mel-cepstral distortion (MCD) [24] for evaluation, which measures the distance between synthesis and ground truth in the space of mel-frequency cepstrum — the smaller the better. For subjective measurements, mean opinion score (MOS) tests are run for naturalness assessment.

For simplicity, ”phn2phn” denotes the situation using phoneme as input in both source and target languages, and ”phn2char” denotes the situation using phoneme input in source language but character input in target languages. Likewise, we denote the model that transfers with separate symbol space, unified symbol space and learned symbol space by ”Separate”, ”Unified” and ”Learned”, respectively.

4.1 Data setup

4.1.1 Source language

In our experiments, English was selected as our high-resource language. For pretraining an initial TTS model, LJ Speech Dataset [25] is used, which is a public domain speech dataset consisting of around 24 hours of text speech paired data. As for ASR training in Section 2.3, we use the LibriSpeech Dataset [26], which is an ASR corpus based on public domain audio books. The training set of 100 hours clean speech and the clean development set are utilized for training and early stopping.

4.1.2 Target language

Mandarin, German, and French are chosen as the target languages. An internal corpus recorded by a female speaker is used for Mandarin experiments. The German data derives from the German LibriVox corpus which is organized by M-AILABS [27]. Data from a female speaker Eva K is used. As for French, we use the data from a female speaker FR010 in the GlobalPhone collection [28], which only consists of approximate 18 minutes paired data. We split the data into training and testing sets as illustrated in Table 144 4 Mandarin and German use the same test sets for both CER and MCD measurements. However, since there is very few French data and MCD test needs ground-truth audio, we randomly select needed training data and leave the rest for testing. This procedure is run three times and the average score is reported..

Table 1: Data statistic of target languages
Language 𝐓𝐫𝐚𝐢𝐧(minutes) 𝐓𝐞𝐬𝐭(utterances)
Mandarin 30 250
German 30 120
French 15 100

4.2 Experimental setup

The initial TTS model is obtained by pretraining on source language for 10k parameter updates. For all transfer learning methods, we continue training on the target language pair with the same initial TTS model parameters.

In ”Separate” (Section 2.1), embedding matrices for target languages are randomly initialized according to the normal distribution with 0 mean and 0.3 standard deviation. In ”Unified” (Section 2.2), all symbols are mapped to IPA. Accordingly, for each symbol of the target language, we initialize its embedding weight from the source symbol that shares the same IPA representation. The embeddings for the remaining symbols are randomly initialized as explained for ”Separate”. As for ”Learned” (Section 2.3), ASR model in stage 1 is pretrained on source language for 300k parameter updates and the best model is selected by the development set. The training data for PTN is the same for finetuning the TTS model on the target languages. The transformation threshold ξ is set to 0.4 for all target languages. Finally, embeddings for target symbols are initialized in the same way as ”Unified”, except that the mapping is now learned automatically.

4.3 Experiment results

4.3.1 Objective tests

First of all, we show the results in the situation ”phn2phn”, where lexicons for target languages are accessible. The CER results are shown in Figure 2. We can see that for any language and any amount of used target data, ”Unified” and ”Learned” consistently outperform ”Separate”, which implies that transferring knowledge with the symbol (phoneme) mapping is beneficial. When the size of target data decreases, ”Separate” deteriorates the most and ”Learned” sticks with ”Unified”. This also indicates that the mapping information is especially effective under very scarce data circumstances and that our learned mapping is competitive with the one based on IPA. Besides, we can notice from Figure 3 that the results of MCD tests also align with the results of CER. The model trained from scratch, where all network weights are randomly initialized, is experimented. However, even if all training data is used, it cannot produce understandable speech and results in CER larger than 80% for every language. Thus, we do not plot its results.

In addition, ”phn2char” setting is also investigated. Because under such setting, the input symbols of target languages are characters, ”Unified” approach is not applicable. In Figure 4, a large gap between ”Learned” and ”Separate” can be observed on German and French55 5 Because the characters of Mandarin correspond to syllable instead of phoneme, ”phn2char” is not reasonable for Mandarin, so its performance is not presented here.. This shows that our proposed method performs well even when source symbols are phoneme-level and target symbols are character-level.

Figure 2: Results of CER under ”phn2phn” scenario.
Figure 3: Results of MCD tests under ”phn2phn” scenario.
Figure 4: Results of CER under ”phn2char” scenario.

4.3.2 Subjective tests

Table 2: Mean Opinion Score (MOS) ratings with 95% confidence intervals for naturalness.
MOS score
Method 25 minutes 15 minutes
Ground Truth 4.89±0.045
Separate 3.94±0.085 2.90±0.176
Unified 4.01±0.085 3.48±0.119
Learned 3.99±0.086 3.46±0.117
Scratch 1.39±0.153 1.26±0.094

To further examine the impact of target data size on the quality of generated speech, we conduct a series of MOS tests. We use 25 minutes and 15 minutes Mandarin paired data for this test under ”phn2phn” setting. The model trained from scratch (denoted by ”Scratch”) is also measured for comparison. In MOS tests, 40 subjects were asked to rate the naturalness for the given speech audio and 80 audio of unseen utterances were used for testing. Each utterance received 5 ratings at least. After listening to each stimulus with headphone, the subjects were asked to rate the naturalness in a five-point Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent).

From Table 2, we can observe that when 25 minutes of paired data is used, three transfer learning methods ”Separate”, ”Unified” and ”Learned” perform almost the same and all of them outperform ”Scratch”. When training data is reduced to 15 minutes, ”Separate” degrades obviously, which is consistent with the discovery of previous objective tests. The results show that given a few but still sufficient paired data (25 min) on the target language, three transfer learning approaches can benefit from pretraining and generate intelligible speech. When paired data becomes fewer (15 min), our proposed approach ”Learned” is comparable to ”Unified” and gives promising results without using background linguistic expertise.

4.4 Symbol mapping studies

Table 3: Precision and recall of found mapping on 15-minute target data.
Mapping Precision Recall 𝐑𝐞𝐜𝐚𝐥𝐥random
ENDE 82.6% 63.3% 3.4%
ENFR 73.7% 56.0% 4.0%
ENZH 64.7% 47.8% 4.5%

In this part, we show that our learned symbol mappings are reasonable and evaluate them according to IPA under ”phn2phn” setting. If one source phoneme and its learned corresponding target phoneme share the same IPA representation, we regard this learned mapping correct. To calculate recall score, we derive total correct mappings from the overlap of two language phoneme sets after being mapped to IPA. For the sake of comparison, we also show the recall score in the case that each source phoneme is randomly mapped to a target phoneme in the overlap. From Table 3, we can observe that our method retrieves highly informative mapping and is far better than random guessing. Besides, we can notice that our method performs better on German and French than on Mandarin, which may result from the similarity to the source language, English. Despite relatively low recall score on Mandarin, our method still discovers some mappings between two similar-sounding phonemes which have different IPA representations. For example, symbol <\textipas> and symbol <\textipaS> are mapped. Although they are not identical according to IPA, their pronunciations are quite alike and similar to ”sh” in English. For more details about the learned mapping please refer to the demo page.

5 Conclusion

In this paper, we explored cross-lingual transfer learning in end-to-end TTS for low-resource languages. We proposed an approach to discover cross-lingual symbol mapping for helping model better transferred with knowledge learned previously from abundant source data. Experiment results show that our method enables the model to produce far more natural-sounding speech than the model trained only on target data and achieves promising results compared with the method using strong linguistic background expertise.


  • [1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” Interspeech, pp. 4006–4010, 2017.
  • [2] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.”
  • [3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2018, pp. 4779–4783.
  • [4] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
  • [5] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017.
  • [6] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” 2017.
  • [7] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018.
  • [8] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” arXiv preprint arXiv:1803.09017, 2018.
  • [9] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, 2018, pp. 4480–4490.
  • [10] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi, and T. Kinnunen, “Can we steal your vocal identity from the internet?: Initial investigation of cloning obama’s voice using gan, wavenet and low-quality found data,” arXiv preprint arXiv:1803.00860, 2018.
  • [11] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Advances in Neural Information Processing Systems, 2018, pp. 10 019–10 029.
  • [12] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2019, pp. 6940–6944.
  • [13] Q. Yu, P. Liu, Z. Wu, S. K. Ang, H. Meng, and L. Cai, “Learning cross-lingual information with multilingual blstm for speech synthesis of low-resource languages,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2016, pp. 5545–5549.
  • [14] A. Gutkin, “Uniform multilingual multi-speaker acoustic model for statistical parametric speech synthesis of low-resourced languages,” 2017.
  • [15] B. Li and H. Zen, “Multi-language multi-speaker acoustic modeling for lstm-rnn based statistical parametric speech synthesis,” 2016.
  • [16] I. P. Association, I. P. A. Staff et al., Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet.    Cambridge University Press, 1999.
  • [17] I. Demirsahin, M. Jansche, and A. Gutkin, “A unified phonological representation of south asian languages for multilingual text-to-speech,” 2018.
  • [18] K. C. Sim and H. Li, “Context-sensitive probabilistic phone mapping model for cross-lingual speech recognition,” in Ninth Annual Conference of the International Speech Communication Association, 2008.
  • [19] K. C. Sim, “Discriminative product-of-expert acoustic mapping for cross-lingual phone recognition,” in 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.    IEEE, 2009, pp. 546–551.
  • [20] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning.    ACM, 2006, pp. 369–376.
  • [21] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
  • [22] K. Krishna, L. Lu, K. Gimpel, and K. Livescu, “A study of all-convolutional encoders for connectionist temporal classification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2018, pp. 5814–5818.
  • [23] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2016, pp. 4960–4964.
  • [24] J. Kominek, T. Schultz, and A. W. Black, “Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion,” in Spoken Languages Technologies for Under-Resourced Languages, 2008.
  • [25] K. Ito et al., “The lj speech dataset,” 2017.
  • [26] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2015, pp. 5206–5210.
  • [27] M. A. I. Laboratories, “The m-ailabs speech dataset,” https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/, 2019.
  • [28] T. Schultz, “Globalphone: a multilingual speech and text database developed at karlsruhe university,” in Seventh International Conference on Spoken Language Processing, 2002.