A Configurable Multilingual Model is All You Need to Recognize All Languages

Abstract

Multilingual automatic speech recognition (ASR) models have shown greatpromise in recent years because of the simplified model training and deploymentprocess. Conventional methods either train a universal multilingual modelwithout taking any language information or with a 1-hot language ID (LID)vector to guide the recognition of the target language. In practice, the usercan be prompted to pre-select several languages he/she can speak. Themultilingual model without LID cannot well utilize the language information setby the user while the multilingual model with LID can only handle onepre-selected language. In this paper, we propose a novel configurablemultilingual model (CMM) which is trained only once but can be configured asdifferent models based on users' choices by extracting language-specificmodules together with a universal model from the trained CMM. Particularly, asingle CMM can be deployed to any user scenario where the users can pre-selectany combination of languages. Trained with 75K hours of transcribed anonymizedMicrosoft multilingual data and evaluated with 10-language test sets, theproposed CMM improves from the universal multilingual model by 26.0%, 16.9%,and 10.4% relative word error reduction when the user selects 1, 2, or 3languages, respectively. CMM also performs significantly better oncode-switching test sets.

Quick Read (beta)

loading the full paper ...