Abstract
Automatic speech recognition has recently seen a significant advancement withlarge foundational models such as Whisper. However, these models often struggleto perform well in low-resource languages, such as Indian languages. This paperexplores two novel approaches to enhance Whisper's multilingual speechrecognition performance in Indian languages. First, we propose prompt-tuningwith language family information, which enhances Whisper's accuracy inlinguistically similar languages. Second, we introduce a novel tokenizer thatreduces the number of generated tokens, thereby accelerating Whisper'sinference speed. Our extensive experiments demonstrate that the tokenizersignificantly reduces inference time, while prompt-tuning enhances accuracyacross various Whisper model sizes, including Small, Medium, and Large.Together, these techniques achieve a balance between optimal WER and inferencespeed.