Abstract
Medical data poses a daunting challenge for AI algorithms: it exists in manydifferent modalities, experiences frequent distribution shifts, and suffersfrom a scarcity of examples and labels. Recent advances, including transformersand self-supervised learning, promise a more universal approach that can beapplied flexibly across these diverse conditions. To measure and drive progressin this direction, we present BenchMD: a benchmark that tests howmodality-agnostic methods, including architectures and training techniques(e.g. self-supervised learning, ImageNet pretraining), perform on a diversearray of clinically-relevant medical tasks. BenchMD combines 19 publiclyavailable datasets for 7 medical modalities, including 1D sensor data, 2Dimages, and 3D volumetric scans. Our benchmark reflects real-world dataconstraints by evaluating methods across a range of dataset sizes, includingchallenging few-shot settings that incentivize the use of pretraining. Finally,we evaluate performance on out-of-distribution data collected at differenthospitals than the training data, representing naturally-occurring distributionshifts that frequently degrade the performance of medical AI models. Ourbaseline results demonstrate that no modality-agnostic technique achievesstrong performance across all modalities, leaving ample room for improvement onthe benchmark. Code is released at https://github.com/rajpurkarlab/BenchMD .