Look, Read and Enrich. Learning from Scientific Figures and their Captions

Abstract

Compared to natural images, understanding scientific figures is particularlyhard for machines. However, there is a valuable source of information inscientific literature that until now has remained untapped: the correspondencebetween a figure and its caption. In this paper we investigate what can belearnt by looking at a large number of figures and reading their captions, andintroduce a figure-caption correspondence learning task that makes use of ourobservations. Training visual and language networks without supervision otherthan pairs of unconstrained figures and captions is shown to successfully solvethis task. We also show that transferring lexical and semantic knowledge from aknowledge graph significantly enriches the resulting features. Finally, wedemonstrate the positive impact of such features in other tasks involvingscientific text and figures, like multi-modal classification and machinecomprehension for question answering, outperforming supervised baselines andad-hoc approaches.

Quick Read (beta)

loading the full paper ...