GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Abstract

We present GPQA, a challenging dataset of 448 multiple-choice questionswritten by domain experts in biology, physics, and chemistry. We ensure thatthe questions are high-quality and extremely difficult: experts who have or arepursuing PhDs in the corresponding domains reach 65% accuracy (74% whendiscounting clear mistakes the experts identified in retrospect), while highlyskilled non-expert validators only reach 34% accuracy, despite spending onaverage over 30 minutes with unrestricted access to the web (i.e., thequestions are "Google-proof"). The questions are also difficult forstate-of-the-art AI systems, with our strongest GPT-4 based baseline achieving39% accuracy. If we are to use future AI systems to help us answer very hardquestions, for example, when developing new scientific knowledge, we need todevelop scalable oversight methods that enable humans to supervise theiroutputs, which may be difficult even if the supervisors are themselves skilledand knowledgeable. The difficulty of GPQA both for skilled non-experts andfrontier AI systems should enable realistic scalable oversight experiments,which we hope can help devise ways for human experts to reliably get truthfulinformation from AI systems that surpass human capabilities.

Quick Read (beta)

loading the full paper ...