LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Abstract

There is widespread optimism that frontier Large Language Models (LLMs) andLLM-augmented systems have the potential to rapidly accelerate scientificdiscovery across disciplines. Today, many benchmarks exist to measure LLMknowledge and reasoning on textbook-style science questions, but few if anybenchmarks are designed to evaluate language model performance on practicaltasks required for scientific research, such as literature search, protocolplanning, and data analysis. As a step toward building such benchmarks, weintroduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset ofover 2,400 multiple choice questions for evaluating AI systems on a range ofpractical biology research capabilities, including recall and reasoning overliterature, interpretation of figures, access and navigation of databases, andcomprehension and manipulation of DNA and protein sequences. Importantly, incontrast to previous scientific benchmarks, we expect that an AI system thatcan achieve consistently high scores on the more difficult LAB-Bench taskswould serve as a useful assistant for researchers in areas such as literaturesearch and molecular cloning. As an initial assessment of the emergentscientific task capabilities of frontier language models, we measureperformance of several against our benchmark and report results compared tohuman expert biology researchers. We will continue to update and expandLAB-Bench over time, and expect it to serve as a useful tool in the developmentof automated research systems going forward. A public subset of LAB-Bench isavailable for use at the following URL:https://huggingface.co/datasets/futurehouse/lab-bench

Quick Read (beta)

loading the full paper ...