Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

Abstract

In this paper, we describe the TTS models developed by NVIDIA for theMMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS bytraining additionally on 5 minutes of target speaker data. In Track 3, weutilize P-Flow to perform zero-shot TTS by training on the challenge dataset aswell as external datasets. We use HiFi-GAN vocoders for all submissions.RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first onTrack 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS)of 3.62.

Quick Read (beta)

loading the full paper ...