Abstract
Multispectral object detection, which integrates information from multiplebands, can enhance detection accuracy and environmental adaptability, holdinggreat application potential across various fields. Although existing methodshave made progress in cross-modal interaction, low-light conditions, and modellightweight, there are still challenges like the lack of a unified single-stageframework, difficulty in balancing performance and fusion strategy, andunreasonable modality weight allocation. To address these, based on the YOLOv11framework, we present YOLOv11-RGBT, a new comprehensive multimodal objectdetection framework. We designed six multispectral fusion modes andsuccessfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. Afterreevaluating the importance of the two modalities, we proposed a P3 mid-fusionstrategy and multispectral controllable fine-tuning (MCF) strategy formultispectral models. These improvements optimize feature fusion, reduceredundancy and mismatches, and boost overall model performance. Experimentsshow our framework excels on three major open-source multispectral objectdetection datasets, like LLVIP and FLIR. Particularly, the multispectralcontrollable fine-tuning strategy significantly enhanced model adaptability androbustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAPby 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework andstrategies' effectiveness. The code is available at:https://github.com/wandahangFY/YOLOv11-RGBT.