Abstract
Medical image and video segmentation is a critical task for precisionmedicine, which has witnessed considerable progress in developing task ormodality-specific and generalist models for 2D images. However, there have beenlimited studies on building general-purpose models for 3D images and videoswith comprehensive user studies. Here, we present MedSAM2, a promptablesegmentation foundation model for 3D image and video segmentation. The model isdeveloped by fine-tuning the Segment Anything Model 2 on a large medicaldataset with over 455,000 3D image-mask pairs and 76,000 frames, outperformingprevious models across a wide range of organs, lesions, and imaging modalities.Furthermore, we implement a human-in-the-loop pipeline to facilitate thecreation of large-scale datasets resulting in, to the best of our knowledge,the most extensive user study to date, involving the annotation of 5,000 CTlesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames,demonstrating that MedSAM2 can reduce manual costs by more than 85%. MedSAM2 isalso integrated into widely used platforms with user-friendly interfaces forlocal and cloud deployment, making it a practical tool for supportingefficient, scalable, and high-quality segmentation in both research andhealthcare environments.