Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Abstract

The interest in Artificial Intelligence (AI) and its applications has seenunprecedented growth in the last few years. This success can be partlyattributed to the advancements made in the sub-fields of AI such as MachineLearning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Thelargest of the growths in these fields has been made possible with deeplearning, a sub-area of machine learning, which uses the principles ofartificial neural networks. This has created significant interest in theintegration of vision and language. The tasks are designed such that theyperfectly embrace the ideas of deep learning. In this survey, we focus on tenprominent tasks that integrate language and vision by discussing their problemformulations, methods, existing datasets, evaluation measures, and compare theresults obtained with corresponding state-of-the-art methods. Our efforts gobeyond earlier surveys which are either task-specific or concentrate only onone type of visual content, i.e., image or video. Furthermore, we also providesome potential future directions in this field of research with an anticipationthat this survey brings in innovative thoughts and ideas to address theexisting challenges and build new applications.

Quick Read (beta)

loading the full paper ...