Abstract
The utilization of speech Self-Supervised Learning (SSL) models achievesimpressive performance on Automatic Speech Recognition (ASR). However, inlow-resource language ASR, they encounter the domain mismatch problem betweenpre-trained and low-resource languages. Typical solutions like fine-tuning theSSL model suffer from high computation costs while using frozen SSL models asfeature extractors comes with poor performance. To handle these issues, weextend a conventional efficient fine-tuning scheme based on the adapter. We addan extra intermediate adaptation to warm up the adapter and downstream modelinitialization. Remarkably, we update only 1-5% of the total model parametersto achieve the adaptation. Experimental results on the ML-SUPERB dataset showthat our solution outperforms conventional efficient fine-tuning. It achievesup to a 28% relative improvement in the Character/Phoneme error rate whenadapting to unseen languages.