Meta Learning for End-to-End Low-Resource Speech Recognition

  • 2019-10-26 16:00:44
  • Jui-Yang Hsu, Yuan-Jui Chen, Hung-yi Lee
  • 2


In this paper, we proposed to apply meta learning approach for low-resourceautomatic speech recognition (ASR). We formulated ASR for different languagesas different tasks, and meta-learned the initialization parameters from manypretraining languages to achieve fast adaptation on unseen target language, viarecently proposed model-agnostic meta learning algorithm (MAML). We evaluatedthe proposed approach using six languages as pretraining tasks and fourlanguages as target tasks. Preliminary results showed that the proposed method,MetaASR, significantly outperforms the state-of-the-art multitask pretrainingapproach on all target languages with different combinations of pretraininglanguages. In addition, since MAML's model-agnostic property, this paper alsoopens new research direction of applying meta learning to more speech-relatedapplications.


Quick Read (beta)

Meta Learning for End-to-End Low-Resource Speech Recognition


In this paper, we proposed to apply meta learning approach for low-resource automatic speech recognition (ASR). We formulated ASR for different languages as different tasks, and meta-learned the initialization parameters from many pretraining languages to achieve fast adaptation on unseen target language, via recently proposed model-agnostic meta learning algorithm (MAML). We evaluated the proposed approach using six languages as pretraining tasks and four languages as target tasks. Preliminary results showed that the proposed method, MetaASR, significantly outperforms the state-of-the-art multitask pretraining approach on all target languages with different combinations of pretraining languages. In addition, since MAML’s model-agnostic property, this paper also opens new research direction of applying meta learning to more speech-related applications.


Meta Learning for End-to-End Low-Resource Speech Recognition

Jui-Yang Hsu  Yuan-Jui Chen  Hung-yi Lee
National Taiwan University
College of Electrical Engineering and Computer Science
{r07921053, r07922070, hungyilee}

Index Terms—  meta-learning, low-resource, speech recognition, language adaptation, IARPA-BABEL

1 Introduction

With the recent advances of deep learning, integrating the main modules of automatic speech recognition (ASR) such as acoustic model, pronunciation lexicon, and language model into a single end-to-end model is highly attractive. Connectionist Temporal Classification (CTC) [8] lends itself on such end-to-end approaches by introducing an additional blank symbol and specifically-designed loss function optimizing to generate the correct character sequences from the speech signal directly, without framewise phoneme alignment in advance. With many recent results [10, 2, 4], end-to-end deep learning has created a larger interest in the speech community.

However, to build such an end-to-end ASR system requires a huge amount of paired speech-transcription data, which is costly. For most languages in the world, they lack sufficient paired data for training. Pretraining on other language sources as the initialization, then fine-tuning on target language is the dominant approach under such low-resource setting, also known as multilingual transfer learning / pretraining (MultiASR) [20, 17]. The backbone of MultiASR is a multitask model with shared hidden layers (encoder), and many language-specific heads. The model structure is designed to learn an encoder to extract language-independent representations to build a better acoustic model from many source languages. The success of “language-independent” features to improve ASR performance compared to monolingual training has been shown in many recent works [3, 5, 18].

(a) MultiASR (b) MetaASR
Fig. 1: Illustration: Difference of the learned parameters from MultiASR & MetaASR. The solid lines represent the learning process of pretraining, either multitask or meta learning. The dashed lines represent the language-specific adaptation.
(The figure is modified from [9])

Besides directly training the model with all the source languages, there are various variants of MultiASR approaches. Language-adversarial training approaches [22, 1] introduced language-adversarial classification objective to the shared encoder, negating the gradients backpropagated from the language classifier to encourage the encoder to extract more language-independent representations. Hierarchical approaches [15] introduced different granularity objectives by combining both character and phoneme prediction at different levels of the model.

In this paper, we provide a novel research direction following up on the idea of multilingual pretraining – Meta learning. Meta learning, or learning-to-learn, has recently received considerable interest in the machine learning community. The goal of meta learning is to solve the problem of “fast adaptation on unseen data”, which is aligned with our low-resource setting. With its success in computer vision under the few-shot learning setting [14, 16, 19], there have been some works in language and speech processing, for instance, language transfer in neural machine translation [9], dialogue generation [12], and speaker adaptive training [11], but not multilingual pretraining for speech recognition.

We use model-agnostic meta-learning algorithm (MAML) [6] in this work. As its name suggestes, MAML can be applied to any network architecture. MAML only modifies the optimization process following meta learning training scheme. It does not introduce additional modules like adversarial training or requires phoneme level annotation (usually through lexicon) such as hierarchical approaches. We evaluated the effectiveness of the proposed meta learning algorithm, MetaASR, on the IARPA BABEL dataset [7]. Our experiments reveal that MetaASR outperforms MultiASR significantly across all target languages.

2 Proposed Approach

2.1 Multilingual CTC Model

Fig. 2: Multilingual CTC model architecture

We used the model architecture as illustrated in Fig. 2, the shared encoder is parameterized by θ, and the set of language-specific heads are parameterized by θh,l (the head for l-th language). Let the dataset be D, composed of paired data (X,C). Let X=x1,x2,,xT with length T as input feature, C=c1,c2,,cL with length L as target label. X is encoded into sequence of hidden states through the shared encoder, then fed into the language-specific head of the corresponding language with softmax activation to output the prediction sequence C^=c^1,c^2,,c^L with length L.

CTC Loss. CTC computes the posterior probability as below,

P(C|X)=π𝒵(C)P(π|X) (1)

where π is the repeated character sequence of C with additional blank label, and 𝒵(C) is the set of all possible sequences given C. For each π, we can approximate the posterior probability as below,

P(π|X)i=1LP(ci^|X) (2)

Take X belonging to the l-th language for instance, the loss function of the model on D is then defined as:

D(θ,θh,l)=-logP(C|X) (3)

2.2 Meta Learning for Low-Resource ASR

The idea of MAML is to learn initialization parameters from a set of tasks. In the context of ASR, we can view different languages as different tasks. Given a set of source tasks 𝒟={D1,D2,,DK}, MAML learns from 𝒟 to obtain good initialization parameters θ for the shared encoder. θ yields fast task-specific learning (fine-tuning) on target task Dt and obtains θt and θh,t (the parameters obtained after fine-tuning on Dt). MAML can be formulated as below,


The two functions, Learn and MetaLearn, will be described in the following two subsections.

2.2.1 Learn: Language-specific learning

Given any initial parameters θ0 of the shared encoder (either random initialized or obtained from pretrained model) and the dataset Dt. The language-specific learning process is to minimize the CTC loss function defined in Eq. 3.

θ,θh,t=𝙻𝚎𝚊𝚛𝚗(Dt;θ0)=argminθ,θh,tDt(θ,θh,t)=argminθ,θh,t(X,C)Dt-logP(C|X) (4)
Table 1: Character error rate (% CER) w.r.t the pretraining languages set for all 4 target languages’ FLP
Model Vietnamese Swahili Tamil Kurmanji
multi meta multi meta multi meta multi meta
(no-pretrain) 71.8 47.5 69.9 64.3
Bn Tl Zu 57.4 49.9 48.1 41.4 65.6 57.5 61.1 57.0
    Tr Lt Gn 63.7 49.5 57.2 41.8 68.2 57.7 65.6 57.0
Bn Tl Zu Tr Lt Gn 59.7 50.1 48.8 42.9 65.6 58.9 62.6 57.6

The learning process is optimized through gradient descent, the same as MultiASR.

2.2.2 MetaLearn

The initialization parameters found by MAML should not only adapt to one language well, but for as many languages as possible. To achieve this goal, we define the meta learning process and the corresponding meta-objective as follows.

In each meta learning episode, we sample batch of tasks from 𝒟, then sample two subsets from each task k as training and testing set, denoted as Dktr, Dkte, respectively. First, we use Dktr to simulate the language-specific learning process to obtain θk and θh,k.

θk,θh,k=𝙻𝚎𝚊𝚛𝚗(Dktr;θ) (5)

Then evaluate the effectiveness of the obtained parameters on Dkte. The goal of MAML is to find θ, the initialization weights of the shared encoder for fast adaptation, so the meta-objective is defined as

𝒟meta(θ)=𝔼k𝒟𝔼Dktr,Dkte[Dkte(θk,θh,k)] (6)

Therefore, the meta learning process is to minimize the loss function defined in Eq. 6.

θ=𝙼𝚎𝚝𝚊𝙻𝚎𝚊𝚛𝚗(𝒟)=argminθ𝒟meta(θ) (7)

We use meta gradient obtained from Eq. 6 to update the model through gradient descent.

θθ-ηkθDkte(θk,θh,k) (8)

η is the meta learning rate. And noted that only the shared encoder is updated via Eq. 8.

MultiASR optimizes the model according to Eq. 4 on all source languages directly, without considering how learning happens on the unseen language. Although the parameters found by MultiASR is good for all source languages, it may not adapt well on the target language. Unlike MultiASR, MetaASR explicitly integrates the learning process into its framework via simulating language-specific learning first, then meta-updates the model. Therefore, the parameters obtained are more suitable to adapt on the unseen language. We illustrate the concept in Fig. 1, and show it in the experimental results in Section 4.

3 Experiment

In this work, we used data from the IARPA BABEL project [7]. The corpus is mainly composed of conversational telephone speech (CTS). We selected 6 languages as non-target languages for multilingual pretraining: Bengali (Bn), Tagalog (Tl), Zulu (Zu), Turkish (Tr), Lithuanian (Lt), Guarani (Gn), and 4 target languages for adaptation: Vietnamese (Vi), Swahili (Sw), Tamil (Ta), Kurmanji (Ku), and experimented different combinations of non-target languages for pretraining. Each language has Full Language Pack (FLP) and Limited Language Pack (LLP, which consists of 10% of FLP).

We followed the recipe provided by Espnet [21] for data preprocessing and final score evaluation. We used 80-dimensional Mel-filterbank and 3-dimensional pitch features as acoustic features. The size of the sliding window is 25ms, and the stride is 10ms. We used the shared encoder with a 6-layer VGG extractor with downsampling and a 6-layer bidirectional LSTM network with 360 cells in each direction as used in the previous work [5].

Meta Learning. For each episode, we used a single gradient step of language-specific learning with SGD when computing the meta gradient. Noted that in Eq. 8, if we expanded the loss term in the summation via Eq. 4, we would find the second-order derivative of θ appear. For computation efficiency, some previous works [6, 13] showed that we could ignore the second-order term without affecting the performance too much.
Therefore, we approximated Eq. 8 as follows.

θθ-ηkθkDkte(θk,θh,k) (9)

Also known as First-order MAML (FOMAML).

We multi-lingually pretrained the model for 100K steps for both MultiASR and MetaASR. When adapting to one certain language, we used the LLP of the other three languages as validation sets to decide which pretraining step we should pick. Then we fine-tuned the model 18 epochs for the target language on its FLP, 20 epochs on its LLP, and evaluated the performance on the test set via beam search decoding with beam size 20 and 5-gram language model re-scoring, as Table 1 and 2 displayed.

4 Results

Table 2: Character error rate (% CER) w.r.t the pretraining languages set for all 4 target languages’ LLP
Model Vietnamese Swahili Tamil Kurmanji
multi meta multi meta multi meta multi meta
(no-pretrain) 74.7 65.0 72.4 68.9
Bn Tl Zu 65.0 58.1 62.6 57.5 70.4 73.7 67.6 64.6
    Tr Lt Gn 64.9 58.0 64.1 59.6 73.7 74.7 69.7 63.0
Bn Tl Zu Tr Lt Gn 64.1 58.7 61.9 59.6 70.0 68.2 66.7 64.1

Performance Comparison of CER on FLP. As presented in Table 1, compared to monolingual training (that is, without using pretrained parameters as initialization, denoted as no-pretrain), both MultiASR and MetaASR improved the ASR performance using different combinations of pretraining languages. Table 1 clearly shows that the proposed MetaASR significantly outperforms MultiASR across all target languages. We were also interested in the impact of the choices of pretraining languages and found that the performance variance of MetaASR is smaller than MultiASR. It might be due to the fact that MetaASR focuses more on the learning process rather than fitting on source languages.


[ width=height=6.0cm, legend entries=MultiASR, MetaASR, no-pretrain , xlabel = Number of pretraining steps (×1000), xmin=5, ymin=36, ymax=67, grid=both, legend style=at=(0.02,0.81),anchor=north west, ylabel=CER (%)] \addplot+[smooth]tablemulti-stat/multi3-swahili; \addplot+[smooth]tablemeta-stat/meta3-swahili; \addplot[style=ultra thick,dashed,] coordinates (0,64.3) (100,64.3);

Fig. 3: Learning curves on Swahili’s LLP
pretrained on Bn, Tl, Zu

[ width=height=6.0cm, legend entries=MultiASR, MetaASR, no-pretrain , xlabel = Number of pretraining steps (×1000), xmin=5, grid=both, legend style=at=(0.02,0.48),anchor=south west, ylabel=CER (%)] \addplot+[smooth]tablemulti-stat/multi6-swahili; \addplot+[smooth]tablemeta-stat/meta6-swahili; \addplot[style=ultra thick,dashed,] coordinates (0,64.3) (100,64.3);

Fig. 4: Learning curves on Swahili’s LLP
pretrained on Bn, Tl, Zu, Tr, Lt, Gn

Learning Curves. The advantage of MetaASR over MultiASR is clearly shown in Fig. 3 and 4. Given the pretrained parameters of the specific pretraining step, we fine-tuned the model for 20 epochs and reported the lowest CER on its validation set. The above process represented one point of the curve. For MultiASR, the performance of adaptation saturated in the early stage and finally degraded. As Fig. 1 illustrates, the training scheme of MultiASR tended to overfit on pretraining languages, and the learned parameters might not be suitable for adaptation. From Fig. 3, we can see that in the later stage of pretraining, using such pretrained weights even yields worse performance than random initialization. In contrast, for MetaASR, not only the performance is better than MultiASR during the whole pretraining process, but it also gradually improves as pretraining continues without degrading. The adaptation of all languages using different pretraining languages show similar trends. We only showed the results of Swahili here due to space limitations.

Impact on Training Set Size. In addition to adapting on FLP of the target languages, we have also fine-tuned on LLP of them, and the result is shown in Table 2. On Vietnamese, Swahili, and Kurmanji, MetaASR also outperforms MultiASR. Both of MultiASR and MetaASR improve the performance, but the gap compared to the no-pretrain model is smaller than fine-tuning on FLP. On Tamil, weights from pretrained model was even worse than random initialization. We will evaluate more combinations of target languages and pretraining languages to investigate the potential of our proposed method in such ultra low-resource scenario.

5 Conclusion

In this paper, we proposed a meta learning approach to multilingual pretraining for speech recognition. The initial experimental results showed its potential in multilingual pretraining. In future work, we plan to use more combinations of languages and corpora to evaluate the effectiveness of MetaASR extensively. Besides, based on MAML’s model-agnostic property, this approach can be applied to a wide range of network architectures such as sequence-to-sequence model, and even different applications beyond speech recognition.


  • [1] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky (2019) Massively multilingual adversarial speech recognition. arXiv preprint arXiv:1904.02210. Cited by: §1.
  • [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In International conference on machine learning, pp. 173–182. Cited by: §1.
  • [3] J. Cho, M. K. Baskar, R. Li, M. Wiesner, S. H. Mallidi, N. Yalta, M. Karafiat, S. Watanabe, and T. Hori (2018) Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 521–527. Cited by: §1.
  • [4] R. Collobert, C. Puhrsch, and G. Synnaeve (2016) Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193. Cited by: §1.
  • [5] S. Dalmia, R. Sanabria, F. Metze, and A. W. Black (2018) Sequence-based multi-lingual low resource speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4909–4913. Cited by: §1, §3.
  • [6] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §3.
  • [7] M. J. Gales, K. M. Knill, A. Ragni, and S. P. Rath (2014) Speech recognition and keyword spotting for low-resource languages: babel project research at cued. In Spoken Language Technologies for Under-Resourced Languages, Cited by: §1, §3.
  • [8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1.
  • [9] J. Gu, Y. Wang, Y. Chen, K. Cho, and V. O. Li (2018) Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437. Cited by: Figure 1, §1.
  • [10] A. Hannun, C. Case, J. Casper, B. Catanzaro, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §1.
  • [11] O. Klejch, J. Fainberg, and P. Bell (2018) Learning to adapt: a meta-learning approach for speaker adaptation. arXiv preprint arXiv:1808.10239. Cited by: §1.
  • [12] F. Mi, M. Huang, J. Zhang, and B. Faltings (2019) Meta-learning for low-resource natural language generation in task-oriented dialogue systems. arXiv preprint arXiv:1905.05644. Cited by: §1.
  • [13] A. Nichol and J. Schulman (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2. Cited by: §3.
  • [14] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960. Cited by: §1.
  • [15] R. Sanabria and F. Metze (2018) Hierarchical multi task learning with ctc. ArXiv abs/1807.07104. Cited by: §1.
  • [16] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §1.
  • [17] S. Tong, P. N. Garner, and H. Bourlard (2017) An investigation of deep neural networks for multilingual speech recognition training and adaptation. In Proc. of INTERSPEECH, Cited by: §1.
  • [18] S. Tong, P. N. Garner, and H. Bourlard (2017) Multilingual training and cross-lingual adaptation on ctc-based acoustic model. arXiv preprint arXiv:1711.10025. Cited by: §1.
  • [19] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1.
  • [20] N. T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, and H. Bourlard (2014) Multilingual deep neural network based acoustic modeling for rapid language adaptation. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7639–7643. Cited by: §1.
  • [21] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al. (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015. Cited by: §3.
  • [22] J. Yi, J. Tao, Z. Wen, and Y. Bai (2018) Adversarial multilingual training for low-resource speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4899–4903. Cited by: §1.