Abstract
Research on applications of Reinforcement Learning (RL) to Large LanguageModels (LLMs) has mostly been focused on single-turn problems, such asmathematical reasoning or single-shot code generation. While these problems canbe viewed as token-level multi-turn MDPs, this view corresponds to a degeneratecase of multi-turn interaction where the environment provides no feedback. Thiscontrasts with many real-world domains, such as software engineering (SWE),which require rich multi-turn interactions with a stateful environment thatresponds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to thisgeneral regime. Using a modified Decoupled Advantage Policy Optimization (DAPO)algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-worldsoftware engineering tasks. Our approach increases the agent's success rate onthe SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to39%, without relying on any teacher models. On SWE-rebench, our agent matchesor outperforms leading open-weight models such as DeepSeek-V3-0324 andQwen3-235B-A22B using an identical scaffolding, offering a viable path towardbuilding more capable autonomous agents for complex real-world problems basedon open models.