Abstract
Cross-modal alignment aims to map heterogeneous modalities into a sharedlatent space, as exemplified by models like CLIP, which benefit fromlarge-scale image-text pretraining for strong recognition capabilities.However, when operating in resource-constrained settings with limited orlow-quality data, these models often suffer from overconfidence and degradedperformance due to the prevalence of ambiguous or weakly correlated image-textpairs. Current contrastive learning approaches, which rely on single positivepairs, further exacerbate this issue by reinforcing overconfidence on uncertainsamples. To address these challenges, we propose Modest-Align, a lightweightalignment framework designed for robustness and efficiency. Our approachleverages two complementary strategies -- Random Perturbation, which introducescontrolled noise to simulate uncertainty, and Embedding Smoothing, whichcalibrates similarity distributions in the embedding space. These mechanismscollectively reduce overconfidence and improve performance on noisy or weaklyaligned samples. Extensive experiments across multiple benchmark datasetsdemonstrate that Modest-Align outperforms state-of-the-art methods in retrievaltasks, achieving competitive results with over 100x less training data and 600xless GPU time than CLIP. Our method offers a practical and scalable solutionfor cross-modal alignment in real-world, low-resource scenarios.