Auto-Encoding Scene Graphs for Image Captioning

Abstract

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the languageinductive bias into the encoder-decoder image captioning framework for morehuman-like captions. Intuitively, we humans use the inductive bias to composecollocations and contextual inference in discourse. For example, when we seethe relation `person on bike', it is natural to replace `on' with `ride' andinfer `person riding bike on a road' even the `road' is not evident. Therefore,exploiting such bias as a language prior is expected to help the conventionalencoder-decoder models less likely overfit to the dataset bias and focus onreasoning. Specifically, we use the scene graph --- a directed graph($\mathcal{G}$) where an object node is connected by adjective nodes andrelationship nodes --- to represent the complex structural layout of both image($\mathcal{I}$) and sentence ($\mathcal{S}$). In the textual domain, we useSGAE to learn a dictionary ($\mathcal{D}$) that helps to reconstruct sentencesin the $\mathcal{S}\rightarrow \mathcal{G} \rightarrow \mathcal{D} \rightarrow\mathcal{S}$ pipeline, where $\mathcal{D}$ encodes the desired language prior;in the vision-language domain, we use the shared $\mathcal{D}$ to guide theencoder-decoder in the $\mathcal{I}\rightarrow \mathcal{G}\rightarrow\mathcal{D} \rightarrow \mathcal{S}$ pipeline. Thanks to the scene graphrepresentation and shared dictionary, the inductive bias is transferred acrossdomains in principle. We validate the effectiveness of SGAE on the challengingMS-COCO image captioning benchmark, e.g., our SGAE-based single-model achievesa new state-of-the-art $127.8$ CIDEr-D on the Karpathy split, and a competitive$125.5$ CIDEr-D (c40) on the official server even compared to other ensemblemodels.

Quick Read (beta)

loading the full paper ...