Abstract
In recent years, deep learning models comprising transformer components havepushed the performance envelope in medical image synthesis tasks. Contrary toconvolutional neural networks (CNNs) that use static, local filters,transformers use self-attention mechanisms to permit adaptive, non-localfiltering to sensitively capture long-range context. However, this sensitivitycomes at the expense of substantial model complexity, which can compromiselearning efficacy particularly on relatively modest-sized imaging datasets.Here, we propose a novel adversarial model for multi-modal medical imagesynthesis, I2I-Mamba, that leverages selective state space modeling (SSM) toefficiently capture long-range context while maintaining local precision. To dothis, I2I-Mamba injects channel-mixed Mamba (cmMamba) blocks in the bottleneckof a convolutional backbone. In cmMamba blocks, SSM layers are used to learncontext across the spatial dimension and channel-mixing layers are used tolearn context across the channel dimension of feature maps. Comprehensivedemonstrations are reported for imputing missing images in multi-contrast MRIand MRI-CT protocols. Our results indicate that I2I-Mamba offers superiorperformance against state-of-the-art CNN- and transformer-based methods insynthesizing target-modality images.