Abstract
Multi-Modal Object Detection (MMOD), due to its stronger adaptability tovarious complex environments, has been widely applied in various applications.Extensive research is dedicated to the RGB-IR object detection, primarilyfocusing on how to integrate complementary features from RGB-IR modalities.However, they neglect the mono-modality insufficient learning problem, whicharises from decreased feature extraction capability in multi-modal jointlearning. This leads to a prevalent but unreasonable phenomenon\textemdashFusion Degradation, which hinders the performance improvement of the MMODmodel. Motivated by this, in this paper, we introduce linear probing evaluationto the multi-modal detectors and rethink the multi-modal object detection taskfrom the mono-modality learning perspective. Therefore, we construct a novelframework called M$^2$D-LIF, which consists of the Mono-Modality Distillation(M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. TheM$^2$D-LIF framework facilitates the sufficient learning of mono-modalityduring multi-modal joint training and explores a lightweight yet effectivefeature fusion manner to achieve superior object detection performance.Extensive experiments conducted on three MMOD datasets demonstrate that ourM$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon andoutperforms the previous SOTA detectors. The codes are available athttps://github.com/Zhao-Tian-yi/M2D-LIF.