Abstract
We propose MusicRL, the first music generation system finetuned from humanfeedback. Appreciation of text-to-music models is particularly subjective sincethe concept of musicality as well as the specific intention behind a captionare user-dependent (e.g. a caption such as "upbeat work-out music" can map to aretro guitar solo or a techno pop beat). Not only this makes supervisedtraining of such models challenging, but it also calls for integratingcontinuous human feedback in their post-deployment finetuning. MusicRL is apretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discreteaudio tokens finetuned with reinforcement learning to maximise sequence-levelrewards. We design reward functions related specifically to text-adherence andaudio quality with the help from selected raters, and use those to finetuneMusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantialdataset comprising 300,000 pairwise preferences. Using Reinforcement Learningfrom Human Feedback (RLHF), we train MusicRL-U, the first text-to-music modelthat incorporates human feedback at scale. Human evaluations show that bothMusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RUcombines the two approaches and results in the best model according to humanraters. Ablation studies shed light on the musical attributes influencing humanpreferences, indicating that text adherence and quality only account for a partof it. This underscores the prevalence of subjectivity in musical appreciationand calls for further involvement of human listeners in the finetuning of musicgeneration models.