Process-Supervised Reinforcement Learning for Code Generation

Abstract

Existing reinforcement learning strategies based on outcome supervision haveproven effective in enhancing the performance of large language models(LLMs)for code generation. While reinforcement learning based on process supervisionhas shown great promise in handling multi-step reasoning tasks, itseffectiveness in code generation remains largely underexplored andunderjustified. The primary obstacle stems from the resource-intensive natureof constructing high-quality process-supervised data, which demands substantialhuman expertise and computational resources. In response to this challenge, wepropose a "statement mutation/refactoring-compile and execution verification"strategy: mutating and refactoring code line-by-line through a teacher model,and utilizing compiler execution results to automatically label each line,resulting in line-by-line process-supervised data, which is pivotal fortraining a process-supervised reward model. The trained reward model is thenintegrated into the PRLCoder framework, followed by experimental validation onseveral benchmarks. Experimental results demonstrate that process-supervisedreinforcement learning significantly surpasses methods relying solely onoutcome supervision. Notably, in tackling complex code generation tasks,process-supervised reinforcement learning shows a clear advantage, ensuringboth the integrity of the code generation process and the correctness of thegeneration results.

Quick Read (beta)

loading the full paper ...