Abstract
We address the problem of code generation from multi-turn execution feedback.Existing methods either generate code without feedback or use complex,hierarchical reinforcement learning to optimize multi-turn rewards. We proposea simple yet scalable approach, $\mu$Code, that solves multi-turn codegeneration using only single-step rewards. Our key insight is that codegeneration is a one-step recoverable MDP, where the correct code can berecovered from any intermediate code state in a single turn. $\mu$Codeiteratively trains both a generator to provide code solutions conditioned onmulti-turn execution feedback and a verifier to score the newly generated code.Experimental evaluations show that our approach achieves significantimprovements over the state-of-the-art baselines. We provide analysis of thedesign choices of the reward models and policy, and show the efficacy of$\mu$Code at utilizing the execution feedback. Our code is available athttps://github.com/portal-cornell/muCode.