Abstract
Training a conventional automatic speech recognition (ASR) system to supportmultiple languages is challenging because the sub-word unit, lexicon and wordinventories are typically language specific. In contrast, sequence-to-sequencemodels are well suited for multilingual ASR because they encapsulate anacoustic, pronunciation and language model jointly in a single network. In thiswork we present a single sequence-to-sequence ASR model trained on 9 differentIndian languages, which have very little overlap in their scripts.Specifically, we take a union of language-specific grapheme sets and train agrapheme-based sequence-to-sequence model jointly on data from all languages.We find that this model, which is not explicitly given any information aboutlanguage identity, improves recognition performance by 21% relative compared toanalogous sequence-to-sequence models trained on each language individually. Bymodifying the model to accept a language identifier as an additional inputfeature, we further improve performance by an additional 7% relative andeliminate confusion between different languages.