Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages

Abstract

Sequence-to-sequence attention-based models integrate an acoustic,pronunciation and language model into a single neural network, which make themvery suitable for multilingual automatic speech recognition (ASR). In thispaper, we are concerned with multilingual speech recognition on low-resourcelanguages by a single Transformer, one of sequence-to-sequence attention-basedmodels. Sub-words are employed as the multilingual modeling unit without usingany pronunciation lexicon. First, we show that a single multilingual ASRTransformer performs well on low-resource languages despite of some languageconfusion. We then look at incorporating language information into the model byinserting the language symbol at the beginning or at the end of the originalsub-words sequence under the condition of language information being knownduring training. Experiments on CALLHOME datasets demonstrate that themultilingual ASR Transformer with the language symbol at the end performsbetter and can obtain relatively 10.5\% average word error rate (WER) reductioncompared to SHL-MLSTM with residual learning. We go on to show that, assumingthe language information being known during training and testing, aboutrelatively 12.4\% average WER reduction can be observed compared to SHL-MLSTMwith residual learning through giving the language symbol as the sentence starttoken.

Quick Read (beta)

loading the full paper ...