Abstract
Automating AI research holds immense potential for accelerating scientificprogress, yet current AI agents struggle with the complexities of rigorous,end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designedto systematically evaluate AI agents on complete research experiments sourcedfrom influential AI publications. Given a research question and incompletestarter code, EXP-Bench challenges AI agents to formulate hypotheses, designand implement experimental procedures, execute them, and analyze results. Toenable the creation of such intricate and authentic tasks with high-fidelity,we design a semi-autonomous pipeline to extract and structure crucialexperimental details from these research papers and their associatedopen-source code. With the pipeline, EXP-Bench curated 461 AI research tasksfrom 51 top-tier AI research papers. Evaluations of leading LLM-based agents,such as OpenHands and IterativeAgent on EXP-Bench demonstrate partialcapabilities: while scores on individual experimental aspects such as design orimplementation correctness occasionally reach 20-35%, the success rate forcomplete, executable experiments was a mere 0.5%. By identifying thesebottlenecks and providing realistic step-by-step experiment procedures,EXP-Bench serves as a vital tool for future AI agents to improve their abilityto conduct AI research experiments. EXP-Bench is open-sourced athttps://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.