TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation

Abstract

Recent unsupervised domain adaptation (UDA) methods have shown great successin addressing classical domain shifts (e.g., synthetic-to-real), but they stillsuffer under complex shifts (e.g. geographical shift), where both thebackground and object appearances differ significantly across domains. Priorworks showed that the language modality can help in the adaptation process,exhibiting more robustness to such complex shifts. In this paper, we introduceTRUST, a novel UDA approach that exploits the robustness of the languagemodality to guide the adaptation of a vision model. TRUST generatespseudo-labels for target samples from their captions and introduces a noveluncertainty estimation strategy that uses normalised CLIP similarity scores toestimate the uncertainty of the generated pseudo-labels. Such estimateduncertainty is then used to reweight the classification loss, mitigating theadverse effects of wrong pseudo-labels obtained from low-quality captions. Tofurther increase the robustness of the vision model, we propose a multimodalsoft-contrastive learning loss that aligns the vision and language featurespaces, by leveraging captions to guide the contrastive training of the visionmodel on target images. In our contrastive loss, each pair of images acts asboth a positive and a negative pair and their feature representations areattracted and repulsed with a strength proportional to the similarity of theircaptions. This solution avoids the need for hardly determining positive andnegative pairs, which is critical in the UDA setting. Our approach outperformsprevious methods, setting the new state-of-the-art on classical (DomainNet) andcomplex (GeoNet) domain shifts. The code will be available upon acceptance.

Quick Read (beta)

loading the full paper ...