Abstract
Code-switching (CS) automatic speech recognition (ASR) faces challenges dueto the language confusion resulting from accents, auditory similarity, andseamless language switches. Adaptation on the pre-trained multi-lingual modelhas shown promising performance for CS-ASR. In this paper, we adapt Whisper,which is a large-scale multilingual pre-trained speech recognition model, to CSfrom both encoder and decoder parts. First, we propose an encoder refiner toenhance the encoder's capacity of intra-sentence swithching. Second, we proposeusing two sets of language-aware adapters with different language promptembeddings to achieve language-specific decoding information in each decoderlayer. Then, a fusion module is added to fuse the language-aware decoding. Theexperimental results using the SEAME dataset show that, compared with thebaseline model, the proposed approach achieves a relative MER reduction of 4.1%and 7.2% on the dev_man and dev_sge test sets, respectively, surpassingstate-of-the-art methods. Through experiments, we found that the proposedmethod significantly improves the performance on non-native language in CSspeech, indicating that our approach enables Whisper to better distinguishbetween the two languages.