Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Abstract

Recently, the development of pre-trained vision language foundation models(VLFMs) has led to remarkable performance in many tasks. However, these modelstend to have strong single-image understanding capability but lack the abilityto understand multiple images. Therefore, they cannot be directly applied tocope with image change understanding (ICU), which requires models to captureactual changes between multiple images and describe them in language. In thispaper, we discover that existing VLFMs perform poorly when applied directly toICU because of the following problems: (1) VLFMs generally learn the globalrepresentation of a single image, while ICU requires capturing nuances betweenmultiple images. (2) The ICU performance of VLFMs is significantly affected byviewpoint variations, which is caused by the altered relationships betweenobjects when viewpoint changes. To address these problems, we propose aViewpoint Integration and Registration method. Concretely, we introduce a fusedadapter image encoder that fine-tunes pre-trained encoders by insertingdesigned trainable adapters and fused adapters, to effectively capture nuancesbetween image pairs. Additionally, a viewpoint registration flow and a semanticemphasizing module are designed to reduce the performance degradation caused byviewpoint variations in the visual and semantic space, respectively.Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that ourmethod achieves state-of-the-art performance in all metrics.

Quick Read (beta)

loading the full paper ...