Abstract
Referring expression comprehension (REC) aims to localize a text-relatedregion in a given image by a referring expression in natural language. Existingmethods focus on how to build convincing visual and language representationsindependently, which may significantly isolate visual and language information.In this paper, we argue that for REC the referring expression and the targetregion are semantically correlated and subject, location and relationshipconsistency exist between vision and language.On top of this, we propose anovel approach called MutAtt to construct mutual guidance between vision andlanguage, which treat vision and language equally thus yield compactinformation matching. Specifically, for each module of subject, location andrelationship, MutAtt builds two kinds of attention-based mutual guidancestrategies. One strategy is to generate vision-guided language embedding forthe sake of matching relevant visual feature. The other reversely generateslanguage-guided visual feature to match relevant language embedding. Thismutual guidance strategy can effectively guarantees the vision-languageconsistency in three modules. Experiments on three popular REC datasetsdemonstrate that the proposed approach outperforms the current state-of-the-artmethods.