Abstract
Medical image segmentation is a fundamental task in numerous medical engineering applications. Recently, language-guided segmentation has shown promise in medical scenarios where textual clinical reports are readily available as semantic guidance. Clinical reports contain diagnostic information provided by clinicians, which can provide auxiliary textual semantics to guide segmentation. However, existing language-guided segmentation methods neglect the inherent pattern gaps between image and text modalities, resulting in sub-optimal visual-language integration. Contrastive learning is a well-recognized approach to align image-text patterns, but it has not been optimized for bridging the pattern gaps in medical language-guided segmentation that relies primarily on medical image details to characterize the underlying disease/targets. Current contrastive alignment techniques typically align high-level global semantics without involving low-level localized target information, and thus cannot deliver fine-grained textual guidance on crucial image details. In this study, we propose a Target-informed Multi-level Contrastive Alignment framework (TMCA) to bridge image-text pattern gaps for medical language-guided segmentation. TMCA enables target-informed image-text alignments and fine-grained textual guidance by introducing: (i) a target-sensitive semantic distance module that utilizes target information for more granular image-text alignment modeling, (ii) a multi-level contrastive alignment strategy that directs fine-grained textual guidance to multi-scale image details, and (iii) a language-guided target enhancement module that reinforces attention to critical image regions based on the aligned image-text patterns. Extensive experiments on four public benchmark datasets demonstrate that TMCA enabled superior performance over state-of-the-art language-guided medical image segmentation methods.