Abstract
Automatic Speech Recognition (ASR) models demonstrate outstanding performanceon high-resource languages but face significant challenges when applied tolow-resource languages due to limited training data and insufficientcross-lingual generalization. Existing adaptation strategies, such as shallowfusion, data augmentation, and direct fine-tuning, either rely on externalresources, suffer computational inefficiencies, or fail in test-time adaptationscenarios. To address these limitations, we introduce Speech Meta In-ContextLEarning (SMILE), an innovative framework that combines meta-learning withspeech in-context learning (SICL). SMILE leverages meta-training fromhigh-resource languages to enable robust, few-shot generalization tolow-resource languages without explicit fine-tuning on the target domain.Extensive experiments on the ML-SUPERB benchmark show that SMILE consistentlyoutperforms baseline methods, significantly reducing character and word errorrates in training-free few-shot multilingual ASR tasks.