HealthBench: Evaluating Large Language Models Towards Improved Human Health

Abstract

We present HealthBench, an open-source benchmark measuring the performanceand safety of large language models in healthcare. HealthBench consists of5,000 multi-turn conversations between a model and an individual user orhealthcare professional. Responses are evaluated using conversation-specificrubrics created by 262 physicians. Unlike previous multiple-choice orshort-answer benchmarks, HealthBench enables realistic, open-ended evaluationthrough 48,562 unique rubric criteria spanning several health contexts (e.g.,emergencies, transforming clinical data, global health) and behavioraldimensions (e.g., accuracy, instruction following, communication). HealthBenchperformance over the last two years reflects steady initial progress (compareGPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3scores 60%). Smaller models have especially improved: GPT-4.1 nano outperformsGPT-4o and is 25 times cheaper. We additionally release two HealthBenchvariations: HealthBench Consensus, which includes 34 particularly importantdimensions of model behavior validated via physician consensus, and HealthBenchHard, where the current top score is 32%. We hope that HealthBench groundsprogress towards model development and applications that benefit human health.

Quick Read (beta)

loading the full paper ...