Abstract
Human speech goes beyond the mere transfer of information; it is a profoundexchange of emotions and a connection between individuals. While Text-to-Speech(TTS) models have made huge progress, they still face challenges in controllingthe emotional expression in the generated speech. In this work, we proposeEmoVoice, a novel emotion-controllable TTS model that exploits large languagemodels (LLMs) to enable fine-grained freestyle natural language emotioncontrol, and a phoneme boost variant design that makes the model output phonemetokens and audio tokens in parallel to enhance content consistency, inspired bychain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, weintroduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuringexpressive speech and fine-grained emotion labels with natural languagedescriptions. EmoVoice achieves state-of-the-art performance on the EnglishEmoVoice-DB test set using only synthetic training data, and on the ChineseSecap test set using our in-house data. We further investigate the reliabilityof existing emotion evaluation metrics and their alignment with humanperceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio andGemini to assess emotional speech. Dataset, code, checkpoints, and demo samplesare available at https://github.com/yanghaha0908/EmoVoice.