Abstract
Medical report generation from imaging data remains a challenging task inclinical practice. While large language models (LLMs) show great promise inaddressing this challenge, their effective integration with medical imagingdata still deserves in-depth exploration. In this paper, we present MRG-LLM, anovel multimodal large language model (MLLM) that combines a frozen LLM with alearnable visual encoder and introduces a dynamic prompt customizationmechanism. Our key innovation lies in generating instance-specific promptstailored to individual medical images through conditional affinetransformations derived from visual features. We propose two implementations:prompt-wise and promptbook-wise customization, enabling precise and targetedreport generation. Extensive experiments on IU X-ray and MIMIC-CXR datasetsdemonstrate that MRG-LLM achieves state-of-the-art performance in medicalreport generation. Our code will be made publicly available.