PaperBench: Evaluating AI's Ability to Replicate AI Research

Abstract

We introduce PaperBench, a benchmark evaluating the ability of AI agents toreplicate state-of-the-art AI research. Agents must replicate 20 ICML 2024Spotlight and Oral papers from scratch, including understanding papercontributions, developing a codebase, and successfully executing experiments.For objective evaluation, we develop rubrics that hierarchically decompose eachreplication task into smaller sub-tasks with clear grading criteria. In total,PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developedwith the author(s) of each ICML paper for accuracy and realism. To enablescalable evaluation, we also develop an LLM-based judge to automatically gradereplication attempts against rubrics, and assess our judge's performance bycreating a separate benchmark for judges. We evaluate several frontier modelson PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet(New) with open-source scaffolding, achieves an average replication score of21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench,finding that models do not yet outperform the human baseline. We open-sourceour code (https://github.com/openai/preparedness) to facilitate future researchin understanding the AI engineering capabilities of AI agents.

Quick Read (beta)

loading the full paper ...