Viewpoint Invariant Change Captioning

Abstract

The ability to detect that something has changed in an environment isvaluable, but often only if it can be accurately conveyed to a human operator.We introduce Viewpoint Invariant Change Captioning, and develop models whichcan both localize and describe via natural language complex changes in anenvironment. Moreover, we distinguish between a change in a viewpoint and anactual scene change (e.g. a change of objects' attributes). To study this newproblem, we collect a Viewpoint Invariant Change Captioning Dataset (VICC),building it off the CLEVR dataset and engine. We introduce 5 types of scenechanges, including changes in attributes, positions, etc. To tackle thisproblem, we propose an approach that distinguishes a viewpoint change from animportant scene change, localizes the change between "before" and "after"images, and dynamically attends to the relevant visual features when describingthe change. We benchmark a number of baselines on our new dataset, andsystematically study the different change types. We show the superiority of ourproposed approach in terms of change captioning and localization. Finally, wealso show that our approach is general and can be applied to real images andlanguage on the recent Spot-the-diff dataset.

Quick Read (beta)

loading the full paper ...