Abstract
Medical ultrasonography is an essential imaging technique for examiningsuperficial organs and tissues, including lymph nodes, breast, and thyroid. Itemploys high-frequency ultrasound waves to generate detailed images of theinternal structures of the human body. However, manually contouring regions ofinterest in these images is a labor-intensive task that demands expertise andoften results in inconsistent interpretations among individuals.Vision-language foundation models, which have excelled in various computervision applications, present new opportunities for enhancing ultrasound imageanalysis. Yet, their performance is hindered by the significant differencesbetween natural and medical imaging domains. This research seeks to overcomethese challenges by developing domain adaptation methods for vision-languagefoundation models. In this study, we explore the fine-tuning pipeline forvision-language foundation models by utilizing large language model as textrefiner with special-designed adaptation strategies and task-driven heads. Ourapproach has been extensively evaluated on six ultrasound datasets and twotasks: segmentation and classification. The experimental results show that ourmethod can effectively improve the performance of vision-language foundationmodels for ultrasound image analysis, and outperform the existingstate-of-the-art vision-language and pure foundation models. The source code ofthis study is available at\href{https://github.com/jinggqu/NextGen-UIA}{GitHub}.