Abstract
With rapid advances in generative artificial intelligence, the text-to-musicsynthesis task has emerged as a promising direction for music generation.Nevertheless, achieving precise control over multi-track generation remains anopen challenge. While existing models excel in directly generating multi-trackmix, their limitations become evident when it comes to composing individualtracks and integrating them in a controllable manner. This departure from thetypical workflows of professional composers hinders the ability to refinedetails in specific tracks. To address this gap, we propose JEN-1 Composer, aunified framework designed to efficiently model marginal, conditional, andjoint distributions over multi-track music using a single model. Building uponan audio latent diffusion model, JEN-1 Composer extends the versatility ofmulti-track music generation. We introduce a progressive curriculum trainingstrategy, which gradually escalates the difficulty of training tasks whileensuring the model's generalization ability and facilitating smooth transitionsbetween different scenarios. During inference, users can iteratively generateand select music tracks, thus incrementally composing entire musical pieces inaccordance with the Human-AI co-composition workflow. Our approach demonstratesstate-of-the-art performance in controllable and high-fidelity multi-trackmusic synthesis, marking a significant advancement in interactive AI-assistedmusic creation. Our demo pages are available at www.jenmusic.ai/research.