Attentive Relational Networks for Mapping Images to Scene Graphs

Abstract

Scene graph generation refers to the task of automatically mapping an imageinto a semantic structural graph, which requires correctly labeling eachextracted objects and their interaction relationships. Despite the recentsuccesses in object detection using deep learning techniques, inferring complexcontextual relationships and structured graph representations from visual dataremains a challenging topic. In this study, we propose a novel AttentiveRelational Network that consists of two key modules with an object detectionbackbone to approach this problem. The first module is a semantictransformation module used to capture semantic embedded relation features, bytranslating visual features and linguistic features into a common semanticspace. The other module is a graph self-attention module introduced to embed ajoint graph representation through assigning various importance weights toneighboring nodes. Finally, accurate scene graphs are produced with therelation inference module by recognizing all entities and the correspondingrelations. We evaluate our proposed method on the widely-adopted Visual GenomeDataset, and the results demonstrate the effectiveness and superiority of ourmodel.

Quick Read (beta)

loading the full paper ...