Abstract
Text-to-speech (TTS) systems typically require high-quality studio data andaccurate transcriptions for training. India has 1369 languages, with 22official using 13 scripts. Training a TTS system for all these languages, mostof which have no digital resources, seems a Herculean task. Our work focuses onzero-shot synthesis, particularly for languages whose scripts and phonotacticscome from different families. The novelty of our work is in the augmentation ofa shared phone representation and modifying the text parsing rules to match thephonotactics of the target language, thus reducing the synthesiser overhead andenabling rapid adaptation. Intelligible and natural speech was generated forSanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh by leveraginglinguistic connections across languages with suitable synthesisers. Evaluationsconfirm the effectiveness of this approach, highlighting its potential toexpand speech technology access for under-represented languages.