Abstract
High-fidelity singing voices usually require higher sampling rate (e.g.,48kHz) to convey expression and emotion. However, higher sampling rate causesthe wider frequency band and longer waveform sequences and throws challengesfor singing voice synthesis (SVS) in both frequency and time domains.Conventional SVS systems that adopt small sampling rate cannot well address theabove challenges. In this paper, we develop HiFiSinger, an SVS system towardshigh-fidelity singing voice. HiFiSinger consists of a FastSpeech based acousticmodel and a Parallel WaveGAN based vocoder to ensure fast training andinference and also high voice quality. To tackle the difficulty of singingmodeling caused by high sampling rate (wider frequency band and longerwaveform), we introduce multi-scale adversarial training in both the acousticmodel and vocoder to improve singing modeling. Specifically, 1) To handle thelarger range of frequencies caused by higher sampling rate, we propose a novelsub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splits the full80-dimensional mel-frequency into multiple sub-bands and models each sub-bandwith a separate discriminator. 2) To model longer waveform sequences caused byhigher sampling rate, we propose a multi-length GAN (ML-GAN) for waveformgeneration to model different lengths of waveform sequences with separatediscriminators. 3) We also introduce several additional designs and findings inHiFiSinger that are crucial for high-fidelity voices, such as adding F0 (pitch)and V/UV (voiced/unvoiced flag) as acoustic features, choosing an appropriatewindow/hop size for mel-spectrogram, and increasing the receptive field invocoder for long vowel modeling. Experiment results show that HiFiSingersynthesizes high-fidelity singing voices with much higher quality: 0.32/0.44MOS gain over 48kHz/24kHz baseline and 0.83 MOS gain over previous SVS systems.