Abstract
Bridging the gap between ego-centric and exo-centric views has been along-standing question in computer vision. In this paper, we focus on theemerging Ego-Exo object correspondence task, which aims to understand objectrelations across ego-exo perspectives through segmentation. While numeroussegmentation models have been proposed, most operate on a single image (view),making them impractical for cross-view scenarios. PSALM, a recently proposedsegmentation method, stands out as a notable exception with its demonstratedzero-shot ability on this task. However, due to the drastic viewpoint changebetween ego and exo, PSALM fails to accurately locate and segment objects,especially in complex backgrounds or when object appearances changesignificantly. To address these issues, we propose ObjectRelator, a novelapproach featuring two key modules: Multimodal Condition Fusion (MCFuse) andSSL-based Cross-View Object Alignment (XObjAlign). MCFuse introduces languageas an additional cue, integrating both visual masks and textual descriptions toimprove object localization and prevent incorrect associations. XObjAlignenforces cross-view consistency through self-supervised alignment, enhancingrobustness to object appearance variations. Extensive experiments demonstrateObjectRelator's effectiveness on the large-scale Ego-Exo4D benchmark andHANDAL-X (an adapted dataset for cross-view segmentation) with state-of-the-artperformance. Code is made available at: http://yuqianfu.com/ObjectRelator.