CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Abstract

AI agents have the potential to aid users on a variety of consequentialtasks, including conducting scientific research. To spur the development ofuseful agents, we need benchmarks that are challenging, but more crucially,directly correspond to real-world tasks of interest. This paper introduces sucha benchmark, designed to measure the accuracy of AI agents in tackling acrucial yet surprisingly challenging aspect of scientific research:computational reproducibility. This task, fundamental to the scientificprocess, involves reproducing the results of a study using the provided codeand data. We introduce CORE-Bench (Computational Reproducibility AgentBenchmark), a benchmark consisting of 270 tasks based on 90 scientific papersacross three disciplines (computer science, social science, and medicine).Tasks in CORE-Bench consist of three difficulty levels and include bothlanguage-only and vision-language tasks. We provide an evaluation system tomeasure the accuracy of agents in a fast and parallelizable way, saving days ofevaluation time for each run compared to a sequential implementation. Weevaluated two baseline agents: the general-purpose AutoGPT and a task-specificagent called CORE-Agent. We tested both variants using two underlying languagemodels: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% onthe hardest task, showing the vast scope for improvement in automating routinescientific tasks. Having agents that can reproduce existing work is a necessarystep towards building agents that can conduct novel research and could verifyand improve the performance of other research agents. We hope that CORE-Benchcan improve the state of reproducibility and spur the development of futureresearch agents.

Quick Read (beta)

loading the full paper ...