Abstract
Current medical artificial intelligence systems are often limited to narrowapplications, hindering their widespread adoption in clinical practice. Toaddress this limitation, we propose MedVersa, a generalist learner that enablesflexible learning and tasking for medical image interpretation. By leveraging alarge language model as a learnable orchestrator, MedVersa can learn from bothvisual and linguistic supervision, support multimodal inputs, and performreal-time task specification. This versatility allows MedVersa to adapt tovarious clinical scenarios and perform multifaceted medical image analysis. Weintroduce MedInterp, the largest multimodal dataset to date for medical imageinterpretation, consisting of over 13 million annotated instances spanning 11tasks across 3 modalities, to support the development of MedVersa. Ourexperiments demonstrate that MedVersa achieves state-of-the-art performance in9 tasks, sometimes outperforming specialist counterparts by over 10%. MedVersais the first to showcase the viability of multimodal generative medical AI inimplementing multimodal outputs, inputs, and dynamic task specification,highlighting its potential as a multifunctional system for comprehensivemedical image analysis. This generalist approach to medical imageinterpretation paves the way for more adaptable and efficient AI-assistedclinical decision-making.