Abstract
Deep generative models applied to audio have improved by a large margin thestate-of-the-art in many speech and music related tasks. However, as rawwaveform modelling remains an inherently difficult task, audio generativemodels are either computationally intensive, rely on low sampling rates, arecomplicated to control or restrict the nature of possible signals. Among thosemodels, Variational AutoEncoders (VAE) give control over the generation byexposing latent variables, although they usually suffer from low synthesisquality. In this paper, we introduce a Realtime Audio Variational autoEncoder(RAVE) allowing both fast and high-quality audio waveform synthesis. Weintroduce a novel two-stage training procedure, namely representation learningand adversarial fine-tuning. We show that using a post-training analysis of thelatent space allows a direct control between the reconstruction fidelity andthe representation compactness. By leveraging a multi-band decomposition of theraw waveform, we show that our model is the first able to generate 48kHz audiosignals, while simultaneously running 20 times faster than real-time on astandard laptop CPU. We evaluate synthesis quality using both quantitative andqualitative subjective experiments and show the superiority of our approachcompared to existing models. Finally, we present applications of our model fortimbre transfer and signal compression. All of our source code and audioexamples are publicly available.