Abstract
We introduce, XoFTR, a cross-modal cross-view method for local featurematching between thermal infrared (TIR) and visible images. Unlike visibleimages, TIR images are less susceptible to adverse lighting and weatherconditions but present difficulties in matching due to significant texture andintensity differences. Current hand-crafted and learning-based methods forvisible-TIR matching fall short in handling viewpoint, scale, and texturediversities. To address this, XoFTR incorporates masked image modelingpre-training and fine-tuning with pseudo-thermal image augmentation to handlethe modality differences. Additionally, we introduce a refined matchingpipeline that adjusts for scale discrepancies and enhances match reliabilitythrough sub-pixel level refinement. To validate our approach, we collect acomprehensive visible-thermal dataset, and show that our method outperformsexisting methods on many benchmarks.