Abstract
In this work, we propose DiffWave, a versatile Diffusion probabilistic modelfor conditional and unconditional Waveform generation. The model isnon-autoregressive, and converts the white noise signal into structuredwaveform through a Markov chain with a constant number of steps at synthesis.It is efficiently trained by optimizing a variant of variational bound on thedata likelihood. DiffWave produces high-fidelity audios in Different Waveformgeneration tasks, including neural vocoding conditioned on mel spectrogram,class-conditional generation, and unconditional generation. We demonstrate thatDiffWave matches a strong WaveNet vocoder in terms of speech quality~(MOS: 4.44versus 4.43), while synthesizing orders of magnitude faster. In particular, itsignificantly outperforms autoregressive and GAN-based waveform models in thechallenging unconditional generation task in terms of audio quality and samplediversity from various automatic and human evaluations.