Abstract
We introduce a new dataset for joint reasoning about language and vision. Thedata contains 107,296 examples of English sentences paired with webphotographs. The task is to determine whether a natural language caption istrue about a photograph. We present an approach for finding visually compleximages and crowdsourcing linguistically diverse captions. Qualitative analysisshows the data requires complex reasoning about quantities, comparisons, andrelationships between objects. Evaluation of state-of-the-art visual reasoningmethods shows the data is a challenge for current methods.
Quick Read (beta)
loading the full paper ...