Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages

Abstract

Speech translation for Indian languages remains a challenging task due to thescarcity of large-scale, publicly available datasets that capture thelinguistic diversity and domain coverage essential for real-world applications.Existing datasets cover a fraction of Indian languages and lack the breadthneeded to train robust models that generalize beyond curated benchmarks. Tobridge this gap, we introduce BhasaAnuvaad, the largest speech translationdataset for Indian languages, spanning over 44 thousand hours of audio and 17million aligned text segments across 14 Indian languages and English. Ourdataset is built through a threefold methodology: (a) aggregating high-qualityexisting sources, (b) large-scale web crawling to ensure linguistic and domaindiversity, and (c) creating synthetic data to model real-world speechdisfluencies. Leveraging BhasaAnuvaad, we train IndicSeamless, astate-of-the-art speech translation model for Indian languages that performsbetter than existing models. Our experiments demonstrate improvements in thetranslation quality, setting a new standard for Indian language speechtranslation. We will release all the code, data and model weights in theopen-source, with permissive licenses to promote accessibility andcollaboration.

Quick Read (beta)

loading the full paper ...