A Corpus for Reasoning About Natural Language Grounded in Photographs

  • 2018-11-01 16:47:44
  • Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, Yoav Artzi
  • 30


We introduce a new dataset for joint reasoning about language and vision. Thedata contains 107,296 examples of English sentences paired with webphotographs. The task is to determine whether a natural language caption istrue about a photograph. We present an approach for finding visually compleximages and crowdsourcing linguistically diverse captions. Qualitative analysisshows the data requires complex reasoning about quantities, comparisons, andrelationships between objects. Evaluation of state-of-the-art visual reasoningmethods shows the data is a challenge for current methods.


Introduction (beta)



Conclusion (beta)