Abstract
Current expressive speech synthesis models are constrained by the limitedavailability of open-source datasets containing diverse nonverbal vocalizations(NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-accessdataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotionalcategories. The dataset is derived from popular sources, VoxCeleb and Expresso,using automated detection followed by human validation. We propose acomprehensive pipeline that integrates automatic speech recognition (ASR), NVtagging, emotion classification, and a fusion algorithm to merge transcriptionsfrom multiple annotators. Fine-tuning open-source text-to-speech (TTS) modelson the NVTTS dataset achieves parity with closed-source systems such asCosyVoice2, as measured by both human evaluation and automatic metrics,including speaker similarity and NV fidelity. By releasing NVTTS and itsaccompanying annotation guidelines, we address a key bottleneck in expressiveTTS research. The dataset is available athttps://huggingface.co/datasets/deepvk/NonverbalTTS.