Abstract
With the growing use of language models (LMs) in clinical environments, thereis an immediate need to evaluate the accuracy and safety of LM-generatedmedical text. Currently, such evaluation relies solely on manual physicianreview. However, detecting errors in LM-generated text is challenging because1) manual review is costly and 2) expert-composed reference outputs are oftenunavailable in real-world settings. While the "LM-as-judge" paradigm (a LMevaluating another LM) offers scalable evaluation, even frontier LMs can misssubtle but clinically significant errors. To address these challenges, wepropose MedVAL, a self-supervised framework that leverages synthetic data totrain evaluator LMs to assess whether LM-generated medical outputs arefactually consistent with inputs, without requiring physician labels orreference outputs. To evaluate LM performance, we introduce MedVAL-Bench, adataset containing 840 outputs annotated by physicians, following aphysician-defined taxonomy of risk levels and error categories. Across 6diverse medical tasks and 10 state-of-the-art LMs spanning open-source,proprietary, and medically adapted models, MedVAL fine-tuning significantlyimproves (p < 0.001) alignment with physicians on both seen and unseen tasks,increasing average F1 scores from 66% to 83%, with per-sample safetyclassification scores up to 86%. MedVAL improves the performance of even thebest-performing proprietary LM (GPT-4o) by 8%. To support a scalable,risk-aware pathway towards clinical integration, we open-source the 1) codebase(https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench(https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), and 3) MedVAL-4B(https://huggingface.co/stanfordmimi/MedVAL-4B), the best-performingopen-source LM. Our research provides the first evidence of LMs approachingexpert-level validation ability for medical text.