Abstract
Scaling test-time compute is a promising axis for improving LLM capabilities.However, test-time compute can be scaled in a variety of ways, and effectivelycombining different approaches remains an active area of research. Here, weexplore this problem in the context of solving real-world GitHub issues fromthe SWE-bench dataset. Our system, named CodeMonkeys, allows models toiteratively edit a codebase by jointly generating and running a testing scriptalongside their draft edit. We sample many of these multi-turn trajectories forevery issue to generate a collection of candidate edits. This approach lets usscale "serial" test-time compute by increasing the number of iterations pertrajectory and "parallel" test-time compute by increasing the number oftrajectories per problem. With parallel scaling, we can amortize up-front costsacross multiple downstream samples, allowing us to identify relevant codebasecontext using the simple method of letting an LLM read every file. In order toselect between candidate edits, we combine voting using model-generated testswith a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeysresolves 57.4% of issues from SWE-bench Verified using a budget ofapproximately 2300 USD. Our selection method can also be used to combinecandidates from different sources. Selecting over an ensemble of edits fromexisting top SWE-bench Verified submissions obtains a score of 66.2% andoutperforms the best member of the ensemble on its own. We fully release ourcode and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.