Abstract
Capturing high-level structure in audio waveforms is challenging because asingle second of audio spans tens of thousands of timesteps. While long-rangedependencies are difficult to model directly in the time domain, we show thatthey can be more tractably modelled in two-dimensional time-frequencyrepresentations such as spectrograms. By leveraging this representationaladvantage, in conjunction with a highly expressive probabilistic model and amultiscale generation procedure, we design a model capable of generatinghigh-fidelity audio samples which capture structure at timescales thattime-domain models have yet to achieve. We apply our model to a variety ofaudio generation tasks, including unconditional speech generation, musicgeneration, and text-to-speech synthesis---showing improvements over previousapproaches in both density estimates and human judgments.