Abstract
Rule-based rewards offer a promising strategy for improving reinforcementlearning from human feedback (RLHF), but current approaches often rely onmanual rule engineering. We present AutoRule, a fully automated method forextracting rules from preference feedback and formulating them into rule-basedrewards. AutoRule extraction operates in three stages: it leverages a reasoningmodel to interpret user preferences, identifies candidate rules from thereasoning chain of these interpretations, and synthesizes them into a unifiedrule set. Leveraging the finalized rule set, we employ language-model verifiersto compute the fraction of rules satisfied by each output, using this metric asan auxiliary reward alongside the learned reward model during policyoptimization. Training a Llama-3-8B model with AutoRule results in a 28.6\%relative improvement in length-controlled win rate on AlpacaEval2.0, and a6.1\% relative gain in second-turn performance on a held-out MT-Bench subset,compared to a GRPO baseline trained with the same learned reward model butwithout the rule-based auxiliary reward. Our analysis confirms that theextracted rules exhibit good agreement with dataset preference. We find thatAutoRule demonstrates reduced reward hacking compared to a learned reward modelwhen run over two episodes. Finally, our case study suggests that the extractedrules capture unique qualities valued in different datasets. The extractedrules are provided in the appendix, and the code is open-sourced athttps://github.com/cxcscmu/AutoRule.