SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Abstract

The recent DeepSeek-R1 release has demonstrated the immense potential ofreinforcement learning (RL) in enhancing the general reasoning capabilities oflarge language models (LLMs). While DeepSeek-R1 and other follow-up workprimarily focus on applying RL to competitive coding and math problems, thispaper introduces SWE-RL, the first approach to scale RL-based LLM reasoning forreal-world software engineering. Leveraging a lightweight rule-based reward(e.g., the similarity score between ground-truth and LLM-generated solutions),SWE-RL enables LLMs to autonomously recover a developer's reasoning processesand solutions by learning from extensive open-source software evolution data --the record of a software's entire lifecycle, including its code snapshots, codechanges, and events such as issues and pull requests. Trained on top of Llama3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solverate on SWE-bench Verified -- a human-verified collection of real-world GitHubissues. To our knowledge, this is the best performance reported formedium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMslike GPT-4o. Surprisingly, despite performing RL solely on software evolutiondata, Llama3-SWE-RL has even emerged with generalized reasoning skills. Forexample, it shows improved results on five out-of-domain tasks, namely,function coding, library use, code reasoning, mathematics, and general languageunderstanding, whereas a supervised-finetuning baseline even leads toperformance degradation on average. Overall, SWE-RL opens up a new direction toimprove the reasoning capabilities of LLMs through reinforcement learning onmassive software engineering data.

Quick Read (beta)

loading the full paper ...