Abstract
Open generative models are vitally important for the community, allowing forfine-tunes and serving as baselines when presenting new models. However, mostcurrent text-to-audio models are private and not accessible for artists andresearchers to build upon. Here we describe the architecture and trainingprocess of a new open-weights text-to-audio model trained with Creative Commonsdata. Our evaluation shows that the model's performance is competitive with thestate-of-the-art across various metrics. Notably, the reported FDopenl3 results(measuring the realism of the generations) showcase its potential forhigh-quality stereo sound synthesis at 44.1kHz.
Quick Read (beta)
loading the full paper ...