Streaming End-to-End Bilingual ASR Systems with Joint Language Identification

Abstract

Multilingual ASR technology simplifies model training and deployment, but itsaccuracy is known to depend on the availability of language information atruntime. Since language identity is seldom known beforehand in real-worldscenarios, it must be inferred on-the-fly with minimum latency. Furthermore, invoice-activated smart assistant systems, language identity is also required fordownstream processing of ASR output. In this paper, we introduce streaming,end-to-end, bilingual systems that perform both ASR and language identification(LID) using the recurrent neural network transducer (RNN-T) architecture. Onthe input side, embeddings from pretrained acoustic-only LID classifiers areused to guide RNN-T training and inference, while on the output side, languagetargets are jointly modeled with ASR targets. The proposed method is applied totwo language pairs: English-Spanish as spoken in the United States, andEnglish-Hindi as spoken in India. Experiments show that for English-Spanish,the bilingual joint ASR-LID architecture matches monolingual ASR andacoustic-only LID accuracies. For the more challenging (owing towithin-utterance code switching) case of English-Hindi, English ASR and LIDmetrics show degradation. Overall, in scenarios where users switch dynamicallybetween languages, the proposed architecture offers a promising simplificationover running multiple monolingual ASR models and an LID classifier in parallel.

Quick Read (beta)

loading the full paper ...