CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Abstract

Recent years have witnessed a trend that large language model (LLM) basedtext-to-speech (TTS) emerges into the mainstream due to their high naturalnessand zero-shot capacity. In this paradigm, speech signals are discretized intotoken sequences, which are modeled by an LLM with text as prompts andreconstructed by a token-based vocoder to waveforms. Obviously, speech tokensplay a critical role in LLM-based TTS models. Current speech tokens are learnedin an unsupervised manner, which lacks explicit semantic information andalignment to the text. In this paper, we propose to represent speech withsupervised semantic tokens, which are derived from a multilingual speechrecognition model by inserting vector quantization into the encoder. Based onthe tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice,which consists of an LLM for text-to-token generation and a conditional flowmatching model for token-to-speech synthesis. Experimental results show thatsupervised semantic tokens significantly outperform existing unsupervisedtokens in terms of content consistency and speaker similarity for zero-shotvoice cloning. Moreover, we find that utilizing large-scale data furtherimproves the synthesis performance, indicating the scalable capacity ofCosyVoice. To the best of our knowledge, this is the first attempt to involvesupervised speech tokens into TTS models.

Quick Read (beta)

loading the full paper ...