Abstract
The idea of combining multiple languages' recordings to train a singleautomatic speech recognition (ASR) model brings the promise of the emergence ofuniversal speech representation. Recently, a Transformer encoder-decoder modelhas been shown to leverage multilingual data well in IPA transcriptions oflanguages presented during training. However, the representations it learnedwere not successful in zero-shot transfer to unseen languages. Because thatmodel lacks an explicit factorization of the acoustic model (AM) and languagemodel (LM), it is unclear to what degree the performance suffered fromdifferences in pronunciation or the mismatch in phonotactics. To gain moreinsight into the factors limiting zero-shot ASR transfer, we replace theencoder-decoder with a hybrid ASR system consisting of a separate AM and LM.Then, we perform an extensive evaluation of monolingual, multilingual, andcrosslingual (zero-shot) acoustic and language models on a set of 13phonetically diverse languages. We show that the gain from modelingcrosslingual phonotactics is limited, and imposing a too strong model can hurtthe zero-shot transfer. Furthermore, we find that a multilingual LM hurts amultilingual ASR system's performance, and retaining only the target language'sphonotactic data in LM training is preferable.