RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

Abstract

Large language models are extensively utilized in creative writingapplications. Creative writing requires a balance between subjective writingquality (e.g., literariness and emotional expression) and objective constraintfollowing (e.g., format requirements and word limits). Existing methods find itdifficult to balance these two aspects: single reward strategies fail toimprove both abilities simultaneously, while fixed-weight mixed-reward methodslack the ability to adapt to different writing scenarios. To address thisproblem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizinga dynamically mixed reward system from a writing reward model evaluatingsubjective writing quality and a constraint verification model assessingobjective constraint following. The constraint following reward weight isadjusted dynamically according to the writing quality within sampled groups,ensuring that samples violating constraints get negative advantage in GRPO andthus penalized during training, which is the key innovation of this proposedmethod. We conduct automated and manual evaluations across diverse modelfamilies from 8B to 72B parameters. Additionally, we construct a real-worldwriting benchmark named WriteEval for comprehensive evaluation. Resultsillustrate that our method achieves consistent improvements in both instructionfollowing (IFEval from 83.36% to 86.65%) and writing quality (72.75% win ratein manual expert pairwise evaluations on WriteEval). To the best of ourknowledge, RLMR is the first work to combine subjective preferences withobjective verification in online RL training, providing an effective solutionfor multi-dimensional creative writing optimization.

Quick Read (beta)

loading the full paper ...