Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR

Abstract

This paper introduces the integration of language-specific bi-directionalcontext into a speech large language model (SLLM) to improve multilingualcontinuous conversational automatic speech recognition (ASR). We propose acharacter-level contextual masking strategy during training, which randomlyremoves portions of the context to enhance robustness and better emulate theflawed transcriptions that may occur during inference. For decoding, atwo-stage pipeline is utilized: initial isolated segment decoding followed bycontext-aware re-decoding using neighboring hypotheses. Evaluated on the1500-hour Multilingual Conversational Speech and Language Model (MLC-SLM)corpus covering eleven languages, our method achieves an 18% relativeimprovement compared to a strong baseline, outperforming even the model trainedon 6000 hours of data for the MLC-SLM competition. These results underscore thesignificant benefit of incorporating contextual information in multilingualcontinuous conversational ASR.

Quick Read (beta)

loading the full paper ...