Unpaired Image Captioning by Language Pivoting

Abstract

Image captioning is a multimodal task involving computer vision and naturallanguage processing, where the goal is to learn a mapping from the image to itsnatural language description. In general, the mapping function is learned froma training set of image-caption pairs. However, for some language, large scaleimage-caption paired corpus might not be available. We present an approach tothis unpaired image captioning problem by language pivoting. Our method caneffectively capture the characteristics of an image captioner from the pivotlanguage (Chinese) and align it to the target language (English) using anotherpivot-target (Chinese-English) sentence parallel corpus. We evaluate our methodon two image-to-English benchmark datasets: MSCOCO and Flickr30K. Quantitativecomparisons against several baseline approaches demonstrate the effectivenessof our method.

Quick Read (beta)

loading the full paper ...