SWE-smith: Scaling Data for Software Engineering Agents

Abstract

Despite recent progress in Language Models (LMs) for software engineering,collecting training data remains a significant pain point. Existing datasetsare small, with at most 1,000s of training instances from 11 or fewer GitHubrepositories. The procedures to curate such datasets are often complex,necessitating hundreds of hours of human labor; companion executionenvironments also take up several terabytes of storage, severely limiting theirscalability and usability. To address this pain point, we introduce SWE-smith,a novel pipeline for generating software engineering training data at scale.Given any Python codebase, SWE-smith constructs a corresponding executionenvironment, then automatically synthesizes 100s to 1,000s of task instancesthat break existing test(s) in the codebase. Using SWE-smith, we create adataset of 50k instances sourced from 128 GitHub repositories, an order ofmagnitude larger than all previous works. We train SWE-agent-LM-32B, achieving40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the artamong open source models. We open source SWE-smith (collection procedure, taskinstances, trajectories, models) to lower the barrier of entry for research inLM systems for automated software engineering. All assets available athttps://swesmith.com.

Quick Read (beta)

loading the full paper ...