Abstract
Automatic Speech Translation (AST) datasets for Indian languages remaincritically scarce, with public resources covering fewer than 10 of the 22official languages. This scarcity has resulted in AST systems for Indianlanguages lagging far behind those available for high-resource languages likeEnglish. In this paper, we first evaluate the performance of widely-used ASTsystems on Indian languages, identifying notable performance gaps andchallenges. Our findings show that while these systems perform adequately onread speech, they struggle significantly with spontaneous speech, includingdisfluencies like pauses and hesitations. Additionally, there is a strikingabsence of systems capable of accurately translating colloquial and informallanguage, a key aspect of everyday communication. To this end, we introduceBhasaAnuvaad, the largest publicly available dataset for AST involving 13 outof 22 scheduled Indian languages and English spanning over 44,400 hours and 17Mtext segments. BhasaAnuvaad contains data for English speech to Indic text, aswell as Indic speech to English text. This dataset comprises three keycategories: (1) Curated datasets from existing resources, (2) Large-scale webmining, and (3) Synthetic data generation. By offering this diverse andexpansive dataset, we aim to bridge the resource gap and promote advancementsin AST for Indian languages.