Abstract
Feature matching is a cornerstone task in computer vision, essential forapplications such as image retrieval, stereo matching, 3D reconstruction, andSLAM. This survey comprehensively reviews modality-based feature matching,exploring traditional handcrafted methods and emphasizing contemporary deeplearning approaches across various modalities, including RGB images, depthimages, 3D point clouds, LiDAR scans, medical images, and vision-languageinteractions. Traditional methods, leveraging detectors like Harris corners anddescriptors such as SIFT and ORB, demonstrate robustness under moderateintra-modality variations but struggle with significant modality gaps.Contemporary deep learning-based methods, exemplified by detector-freestrategies like CNN-based SuperPoint and transformer-based LoFTR, substantiallyimprove robustness and adaptability across modalities. We highlightmodality-aware advancements, such as geometric and depth-specific descriptorsfor depth images, sparse and dense learning methods for 3D point clouds,attention-enhanced neural networks for LiDAR scans, and specialized solutionslike the MIND descriptor for complex medical image matching. Cross-modalapplications, particularly in medical image registration and vision-languagetasks, underscore the evolution of feature matching to handle increasinglydiverse data interactions.