Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages

Abstract

Sequence-to-sequence attention-based models integrate an acoustic,pronunciation and language model into a single neural network, which make themvery suitable for multilingual automatic speech recognition (ASR). In thispaper, we are concerned with multilingual end-to-end speech recognition onlow-resource languages by a single Transformer, one of sequence-to-sequenceattention-based models. Sub-words are employed as the multilingual modelingunit without using any pronunciation lexicon. First, we show that a singlemultilingual ASR Transformer performs well on low-resource languages despite ofsome language confusion. We then look at incorporating language-specificinformation into the model by inserting the language symbol at the beginning orat the end of the original sub-words sequence under the condition of languageinformation being known during training. Experiments on CALLHOME datasetsdemonstrate that the multilingual ASR Transformer with the language symbol atthe end performs better and can obtain relatively 10.5\% average word errorrate (WER) reduction compared to SHL-MLSTM with residual learning. We go on toshow that, assuming the language information being known during training andtesting, about relatively 12.4\% average WER reduction can be observed comparedto SHL-MLSTM with residual learning through giving the language symbol as thesentence start token.

Quick Read (beta)

loading the full paper ...