kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech

Abstract

While recent zero-shot multi-speaker text-to-speech (TTS) models achieveimpressive results, they typically rely on extensive transcribed speechdatasets from numerous speakers and intricate training pipelines. Meanwhile,self-supervised learning (SSL) speech features have emerged as effectiveintermediate representations for TTS. Further, SSL features from differentspeakers that are linearly close share phonetic information while maintainingindividual speaker identity. In this study, we introduce kNN-TTS, a simple andeffective framework for zero-shot multi-speaker TTS using retrieval methodswhich leverage the linear relationships between SSL features. Objective andsubjective evaluations show that our models, trained on transcribed speech froma single speaker only, achieve performance comparable to state-of-the-artmodels that are trained on significantly larger training datasets. The lowtraining data requirements mean that kNN-TTS is well suited for the developmentof multi-speaker TTS systems for low-resource domains and languages. We alsointroduce an interpolation parameter which enables fine-grained voice morphing.Demo samples are available at https://idiap.github.io/knn-tts

Quick Read (beta)

loading the full paper ...