Abstract
Current image fusion methods struggle to adapt to real-world environmentsencompassing diverse degradations with spatially varying characteristics. Toaddress this challenge, we propose a robust fusion controller (RFC) capable ofachieving degradation-aware image fusion through fine-grained languageinstructions, ensuring its reliable application in adverse environments.Specifically, RFC first parses language instructions to innovatively derive thefunctional condition and the spatial condition, where the former specifies thedegradation type to remove, while the latter defines its spatial coverage.Then, a composite control priori is generated through a multi-conditioncoupling network, achieving a seamless transition from abstract languageinstructions to latent control variables. Subsequently, we design a hybridattention-based fusion network to aggregate multi-modal information, in whichthe obtained composite control priori is deeply embedded to linearly modulatethe intermediate fused features. To ensure the alignment between languageinstructions and control outcomes, we introduce a novel language-featurealignment loss, which constrains the consistency between feature-level gainsand the composite control priori. Extensive experiments on publicly availabledatasets demonstrate that our RFC is robust against various compositedegradations, particularly in highly challenging flare scenarios.