Msanii: High Fidelity Music Synthesis on a Shoestring Budget

Abstract

In this paper, we present Msanii, a novel diffusion-based model forsynthesizing long-context, high-fidelity music efficiently. Our model combinesthe expressiveness of mel spectrograms, the generative capabilities ofdiffusion models, and the vocoding capabilities of neural vocoders. Wedemonstrate the effectiveness of Msanii by synthesizing tens of seconds (190seconds) of stereo music at high sample rates (44.1 kHz) without the use ofconcatenative synthesis, cascading architectures, or compression techniques. Tothe best of our knowledge, this is the first work to successfully employ adiffusion-based model for synthesizing such long music samples at high samplerates. Our demo can be found https://kinyugo.github.io/msanii-demo and our codehttps://github.com/Kinyugo/msanii .

Quick Read (beta)

loading the full paper ...