Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

  • 2020-09-12 13:26:29
  • Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow
  • 0

Abstract

The interest in Artificial Intelligence (AI) and its applications has seenunprecedented growth in the last few years. This success can be partlyattributed to the advancements made in the sub-fields of AI such as MachineLearning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Thelargest of the growths in these fields has been made possible with deeplearning, a sub-area of machine learning, which uses the principles ofartificial neural networks. This has created significant interest in theintegration of vision and language. The tasks are designed such that theyperfectly embrace the ideas of deep learning. In this survey, we focus on tenprominent tasks that integrate language and vision by discussing their problemformulations, methods, existing datasets, evaluation measures, and compare theresults obtained with corresponding state-of-the-art methods. Our efforts gobeyond earlier surveys which are either task-specific or concentrate only onone type of visual content, i.e., image or video. Furthermore, we also providesome potential future directions in this field of research with an anticipationthat this survey brings in innovative thoughts and ideas to address theexisting challenges and build new applications.

 

Quick Read (beta)

Trends in Integration of Vision and Language Research:
A Survey of Tasks, Datasets, and Methods

\nameAditya Mogadala \email[email protected]
\nameMarimuthu Kalimuthu \email[email protected]
\nameDietrich Klakow \email[email protected]
\addrSpoken Language Systems (LSV)
Saarland Informatics Campus
Saarland University
66123 Saarbrücken, Germany
Abstract

The interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). The largest of the growths in these fields has been made possible with deep learning, a sub-area of machine learning, which uses the principles of artificial neural networks. This has created significant interest in the integration of vision and language. The tasks are designed such that they perfectly embrace the ideas of deep learning. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey brings in innovative thoughts and ideas to address the existing challenges and build new applications.

\makenomenclature
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods Aditya Mogadala [email protected]
Marimuthu Kalimuthu [email protected]
Dietrich Klakow [email protected]
Spoken Language Systems (LSV)
Saarland Informatics Campus
Saarland University
66123 Saarbrücken, Germany

1 Introduction

Recent advancements in deep learning research have led the fields of Computer Vision (CV) and Natural Language Processing (NLP) to see significant progress in several tasks. Independent of NLP, computer vision has achieved prominent improvements in tasks such as visual content classification (?), object detection (?), segmentation (?), etc., using self-supervision (?) or large annotated datasets. Similarly, independent from computer vision, NLP has seen a surge of interest in solving multiple tasks at once with unsupervised pretraining of language models (?, ?, ?) using large unlabeled corpora. However, there is also interest in solving challenges that combine linguistic and visual information from these traditionally independent fields. The methods which address the challenge of integration should provide complete understanding of visual or textual content, and are expected to (1) generate comprehensible but concise and grammatically well-formed descriptions of the visual content, or vice versa by generating the visual content given a textual description in a natural language, (2) identify objects in the visual content and infer their relationships to reason about or answer arbitrary questions about them, (3) navigate through an environment by leveraging input from both vision and natural language instructions, (4) translate textual content from one language to another while using the visual content for disambiguation, (5) generate stories about the visual content, and so on. Designing methods which can process and relate information from multiple modalities (i.e., linguistic and visual information) is usually considered to be a sub-part of multimodal learning models (?). Efficiently solving the above-mentioned and related challenges can result in many potential applications. For example, visually impaired individuals can be assisted by visual scene understanding, where they can get information about a scene from generated descriptions and by being able to ask questions about it. Other applications include automatic surveillance (?), autonomous driving (?), human-computer interaction (?), city navigation (?), and so on. Also, solving such challenges can provide an excellent test bed for computer vision and NLP systems, one that is much more comprehensive than independent CV and NLP evaluations. Given such a broad scope for fundamental and applied research, there has been several surveys in the recent years to provide a comprehensive overview of the integration of vision and language tasks. These surveys have, however, concentrated on covering specific vision and language integration tasks such as image description (?, ?, ?) or video description generation (?), visual question answering (?, ?), action recognition (?) and visual semantics (?). The surveys which went beyond these specific tasks have summarized dataset statistics (?), provided a comprehensive overview of only NLP tasks such as natural language generation (NLG) (?) and commonsense reasoning (?). However, there is also an attempt to cover multiple modalities (including sound) (?), but it is structured in a bottom-up manner giving more importance to underlying fusion technologies than the task itself. Also, there was some interest in understanding the limitations of integration of vision and language research (?). However, it is limited to the tasks of language-grounded image understanding. Furthermore, there were ideas to develop theories on the complementarity of language and visual data in the human-machine communication from a theoretical point of view (?). In this survey, we go beyond these and present a comprehensive overview of ten different tasks that are prominent in the current integration of vision and language research. We first begin with a background about the traditional tasks in CV and NLP separately and show how they facilitate in designing the prominent ten tasks for the integration of vision and language in Section 2. Following, we provide an in-depth exploration of each of the ten tasks and present more details about the datasets, methods, results, and open challenges in separate sections beginning at Section 3 and ending at Section 9. Further, in Section 10, we introduce details about joint pretraining of vision and language for solving multiple tasks at once. It is then followed in Section 11 by potential future research directions. Finally, in Section 12, we conclude our survey and offer some insights.

2 Background

In this section, we first briefly introduce some of the standard tasks observed in computer vision and NLP separately. Following, we present how the tasks are modified such that they facilitate in designing ten prominent tasks for the integration of vision and language.

2.1 CV Tasks

Several tasks are present in computer vision, which are highly diverse. However, only some of those tasks are commonly used due to their strong application to downstream applications. Keeping in mind the underlying goal of computer vision is to describe and explain visual information, we divide the tasks from the perspective of where the visual information arises. In this survey, we mainly focus on image and video as the visual information.

2.1.1 Image as Visual Information

Whenever images are used as the visual information, we need to consider two important points: (1) Knowing the tasks where images are used as input and (2) Representation of an image. In the following, we discuss various computer vision tasks that use images as the input and present the recent progress made in representing images.

Tasks.

There are several tasks in computer vision which use images as input. Although some of them look similar, there is a distinction between them. We list out those that are popularly used such as (1) Image Classification (2) Object Localization (3) Object Detection (4) Object Segmentation (5) Object Identification (6) Instance segmentation and (7) Panoptic segmentation. There are also advanced tasks that use images as visual information and assist in the integration of computer vision and NLP, which are: (1) Image Style Transfer (2) Image Colorization (3) Image Reconstruction and (4) Image Synthesis.

Representation.

The advent of deep learning (?) has tremendously changed the field of computer vision. The best way to represent images is by leveraging automatic feature extraction methods. Convolutional Neural Networks (CNNs) (?) have become the de facto standard for generating representations of images using end-to-end trainable models. There are several variations of CNNs that learn image features with supervised or self-supervised techniques (?). Most of these techniques are designed to learn transferable general image features by leveraging tasks presented earlier. Usually, the most preferred transferable global image representations are learned with deep CNN architectures such as AlexNet (?), VGGNet (?), GoogLeNet (?), Inception-v3 (?), Residual Networks (ResNet) (?), Dense Nets (?), and Efficient Net (?) using large datasets, viz. ImageNet11 1 http://www.image-net.org/ (?), MSCOCO22 2 http://cocodataset.org/#home (?), and Visual Genome33 3 https://visualgenome.org/ (?). However, for some vision and language integration tasks, it is preferred to learn global image features during task-specific training as opposed to learning generic, pretrained representations. For learning local features of objects in the images represented with bounding boxes, the preferred choice is to utilize region specific CNN architectures such as Region-based CNN (R-CNN) (?). More recently, there is an interest in using self-attention based approaches, namely Transformer (?) for achieving end-to-end object detection (?).

2.1.2 Video as Visual Information

Similar to with images, when a video is used as the visual information we need to consider two crucial things: (1) Knowing the tasks where videos are used as input and (2) Representation of a video. In the following, we discuss different tasks in computer vision that use video as input and further present the recent progress made in video representation.

Tasks.

Most of the tasks in CV are centered on images, however, the tasks on videos are also gaining importance, such as (1) Object tracking (2) Action classification (3) Emotion Detection (4) Scene Detection and (5) Automated Editing.

Representation.

Images present in the 3D channel are extended into 4D. Usually, visual data observed in videos is extracted in the form of screenshots that leverage the same techniques to image local and global representation. However, in addition, spatio-temporal features are also developed with general video analysis such as C3D (?) or from action recognition datasets i.e., Kinetics action recognition (?) to build R3D or I3D features (?) using different CNN architectures.

2.2 NLP Tasks

There are various standard tasks in NLP. However, some of the tasks in particular are generally used due to their applications in downstream applications. Taking into consideration that the underlying intent of NLP tasks is to comprehend or to generate language, we look into some of the popular tasks that are driving the NLP research. Also, we present the approaches used to represent language.

Tasks.

The aim of NLP tasks is to understand or generate language. Some of the traditional tasks that are used to comprehend language are shallow parsing, syntax parsing, semantic role labeling, named entity recognition, entity linking, co-reference resolution, etc. Similarly, the tasks which are designed to generate language in a conditional or unconditional manner are machine translation, summarization, etc.

Representation.

Language is usually represented either with bag-of-words or with sentence representations. For words in a sentence, initializations are commonly done with pretrained word embeddings (?, ?). Additionally, to represent variable-length text, sequence learning techniques such as recurrent neural networks variations like unidirectional Long Short-Term Memory (LSTM) (?), bidirectional LSTM (BiLSTM) and unidirectional Gated Recurrent Units (GRU) (?), or bidirectional GRU (BiGRU) are applied. Recently, to provide parallelization in sequential training, Transformers (?) have been used to build architectures such as BERT (?) and its variations.

2.3 CV and NLP Integration Tasks

Over the past few years, significant progress has been made in the research concerning the integration of language and vision. Several tasks exist which combine language observed at different levels (such as words, phrases, sentences, paragraphs, and documents) with visual information represented by images or videos. Initially, most works concentrated on combining low-level linguistic units, such as words with images or videos for building visual-semantic embeddings (?, ?, ?, ?, ?, ?, ?, ?, ?), which are beneficial for downstream applications, as well as understanding adversarial attacks (?) to improve model robustness. However, it will be appealing to look into those tasks that go beyond words and consider variable-length texts larger than words as language input. Most of these tasks are seen as an extension to either CV, NLP, or both. Figure 1 describes different tasks.

Figure 1: Ten different Language and Vision integration tasks.

To get a grasp on how those tasks are seen as a natural extension of tasks in computer vision, NLP, or both, we briefly find their relation with similar tasks addressed in their research.

Extension of NLP Tasks
  • Visual Description Generation is closely related to conditional language modeling (?) or Natural Language Generation (NLG) (?) tasks in NLP. Given non-linguistic information (e.g., image or video), the goal is to generate a human-readable text snippet that describes the input.

  • The task of Visual Storytelling solves a similar problem to visual description generation. However, instead of dealing with a single visual input, a sequence of visual inputs is used to generate a narrative summary based on the text aligned with them. It can be seen that the task is closely aligned to text summarization (?, ?), mostly generating abstractive summaries.

  • Visual Question Answering draws its inspiration from the text-based question-answering (?, ?) which is one of the long standing NLP research topics. Here, answering questions about visual information is seen as its natural extension.

  • The task of Visual Dialog aims at creating a meaningful dialog in a natural and conversational language about a visual content. It is seen as a visual analogue of the text-based dialog and conversation system (?, ?, ?) that has been explored in NLP over many years.

  • Visual Referring Expression is an extension of referring expression (?) in natural language generation systems. Also, the sub-problem in visual referring expression (i.e., comprehension) is seen as an analogy of pragmatics in linguistics (?) due to its usage of context.

  • Visual Entailment is an inference task for predicting whether the image semantically entails the text. It is a natural extension to natural language inference (?, ?), where the premise is text, instead of a visual content.

  • Multimodal Machine Translation aims to perform translation from source language(s) to target language(s) by leveraging the visual information along with the text in source language(s). It is influenced by the well-known NLP task of automatically translating textual contents between two languages (?, ?).

Extension of CV Tasks
  • Visual Generation deals with the generation of visual content by conditioning on the text. It can be seen as a multimodal extension of the popular computer vision tasks of image-to-image translation (?) and neural style transfer (?).

  • The task of Visual Reasoning is a direct extension of visual perception where standard computer vision tasks such as object classification (?), detection (?), or segmentation (?) are performed. Instead of providing only class labels (in case of classification), bounding boxes (in case of detection), or segments (in case of segmentation), visual reasoning is expected to provide a relationship between detected objects by generating an entire visual scene graph. Furthermore, the scene graph is leveraged to reason and answer questions about visual information. It can also be used to reason about whether a natural language statement is true regarding a visual input (?).

Extension of both NLP and CV Tasks
  • Vision-and-Language Navigation is one task that can be seen as a transition from standard vision-based navigation using only visual input (?, ?) or natural language instruction based navigation (?, ?). The expectation here is that natural language navigation instruction should be interpreted based on visual input. Hence, it combines both vision and language.

Representation.

In earlier sections, we discussed different architectures used to represent both vision and language separately. Combining representations of language and vision is essential to address vision and language integrated tasks. There are various models proposed for each task to build representations integrating vision and language. We discuss more about them in each of the task sections.

2.4 Summary

In this section, we have seen tasks that integrate computer vision and NLP. Also, we explored diverse methods that are used for the representation of vision and language. To train these methods, standard gradient descent optimization algorithms such as Stochastic Gradient Descent (SGD) (?), ADAM (?) or RMSProp (?) are used. Furthermore, some methods also leverage Reinforcement Learning (RL) (?). One can observe that most of the tasks use similar architectures for the representation of vision and language and depend on standard optimization algorithms for training. This shows that, although the aim of the task is different, the underlying principles to extract meaning from the unstructured data remain constant.

3 Visual Description Generation and Storytelling

In this section, we explore two different tasks, Visual Description Generation and Visual Storytelling. Although the goals of these tasks do not perfectly line up, they share the common intention of generating a textual description when conditioned on visual input. In the following, we present more details about each of these tasks separately.

3.1 Visual Description Generation

The aim of description generation is to generate either a global or a dense description for a given visual input. However, there are various ways to explore the problem with different types of visual input, i.e., either an image or a video.

3.1.1 Image Description Generation - Introduction

There are many subareas of image description generation where the underlying goal of generating global or dense descriptions remains the same, but the way those descriptions appear is different. In the following section, we explore some of the popular categories observed in image description generation.

Standard Image Description Generation.

The goal of the standard image description generation is to a generate sentence-level descriptions given an image. They leverage the vocabulary of the dataset to generate the best description that depicts the scene in the image. Figure 2 summarizes the task.

Figure 2: Given an image, the Standard Image Caption Generation Model generates a single global textual description.

Initially, several methods were developed based on templates, n-grams and dependency parsing (?, ?, ?, ?, ?, ?, ?). Recently, however, image description generation models based on the encoder-decoder framework (?) have become popular and have been extended with the attention mechanism (?) to support the selection of local image features that are useful for the generation of words at each time step. Table 1 summarizes different setups for generating image descriptions using neural network based non-attention, attention, and reinforcement learning (RL) approaches. Other variations include cross-lingual image captioning (?) and multi-language image description generation (?). In the following, we explore some of the related ideas that expand the scope of image description generation.

Approach Attention RL
MLBL (?)
m-RNN (?)
Minds Eye (?)
BRNN (?)
NIC (?)
LRCN (?)
Guided LSTM (?)
Deep Bidirectional LSTM (?)
Regional Visual Attributes (?)
Language CNN (?)
ConceptNet-NIC (?)
Visual Attention (?)
Region-based Attention (?)
Attribute Attention (?)
Review Attention (?)
Adaptive Attention (?)
Areas of Attention (?)
Contrastive Adaptive Attention (?)
Neural Baby Talk w/ Attention (?)
Convolutional Attention (?)
Self-Critical Attention (?)
Policy Gradient (?)
Up-Down (?)
Multi-task Captioning (?)
Stack Captioning (?)
Table 1: Summary of methods for generating a global description of an image. Approaches are segregated based on their usage of no-attention, attention, and RL techniques.
Dense Image Description Generation.

Dense image description generation aims to create descriptions at the local object-level, referred to as dense captions. Several approaches (?, ?, ?, ?) exist to generate dense captions. Usually, they use representations of phrases and their relationships to generate descriptions (?).

Image Paragraph Generation.

Image paragraph generation aims to create paragraphs instead of generating a single simple description or dense descriptions for an image. Generated paragraphs are expected to be coherent and contain fine-grained natural language descriptions (?, ?).

Spoken Language Image Description Generation.

Spoken language image description generation expands the description generation task to work with spoken language, instead of limiting to only the written form of language. Approaches such as visually grounded speech signals (?) address the standard image description generation task from the perspective of a spoken language.

Stylistic Image Description Generation.

Stylistic image description generation adds styles to the standard image description generation, where the generated descriptions adhere to a specific style. For example, ? (?) generated captions which capture the sentiments from an image, while ? (?) generated humorous and romantic captions. It has also been extended by leveraging unpaired textual corpora (?) to generate story-like captions. Furthermore, to make the generated captions more human-like, personality traits have been used to generate captions (?). Recently, multi-style image description generation (?) has been explored, in which a single model using unpaired data is built to generate different stylized captions.

Unseen Objects Image Description Generation.

Unseen objects image description generation leverages images which lack paired descriptions. Most of the paired image-description datasets have few visual objects to represent. Hence, methods such as Deep Compositional Captioning (DCC) (?), Novel Object Captioner (NOC) (?), Constrained Beam Search (CBS) (?), and LSTM-C (?) address the challenge of generating descriptions for these images. They generate descriptions for visual object categories that are previously unseen in image-description corpora, either by transferring information between seen and unseen objects before inference (i.e., before test time), or by keeping constraints on the generation of description words during inference (i.e., during test time). A few approaches (?, ?) have transferred information both before and during inference. Recently, pointing LSTM was designed to point to the novel objects (?) by balancing generation and copying of words. Nevertheless, earlier approaches work only with a limited set of objects. To address this issue, a large-scale nocaps dataset (?) was created.

Diverse Image Description Generation.

Diverse image description generation incorporates diversity in the generated captions. A few approaches (?, ?) have leveraged adversarial training, while ? (?) used diverse beam search to decode diverse image captions in English. Approaches have also been proposed to describe cross-domain images (?).

Controllable Image Description Generation.

Controllable image description generation selects specific objects in an image, defined by a control signal, to generate descriptions. Initially, ? (?) generated layouts from images, while ? (?) counted image objects to produce multiple captions for a given image. Additionally, a control signal has been used to make the image captioning more controllable and to generate diverse captions. ? (?) used either a sequence or a set of image regions. Also, chunks of the generated sentences were explicitly grounded on regions. Furthermore, instead of making captions only diverse, there were also attempts (?) to make the generated descriptions accurate.

3.1.2 Image Description Generation - Datasets

There are a wide range of datasets available for the integration of vision and language research. In fact, they are one of the main driving forces behind recent accelerated advancements that we are witnessing in this field. Visual information associated with textual content in these datasets differ from each other in many aspects such as size, quality, and the way in which they are collected. In this survey, we summarize the characteristics of these datasets and give an overview. However, we do not provide a deeper analysis of them, as this was done by ? (?). Many datasets were created in the past decade to address the challenge of image description generation. Some of the early large-scale datasets focus on image captions, while the others are only small- or medium-scale. In the following sections, we cover only those datasets that are extensively used in the literature.

SBU Captioned Photo Dataset (SBU1M).

SBU1M44 4 http://vision.cs.stonybrook.edu/~vicente/sbucaptions (?) is an automatically collected image description dataset that uses query terms to retrieve images and associated text from Flickr55 5 https://www.flickr.com. This web-scale dataset is distributed as a single plain text file containing 1 million URLs of Flickr images and their corresponding captions. Although one of the older datasets in image description research, it has been rarely used in recent years. Table 2 provides basic statistics about this dataset.

Total Images Captions per Image Total Captions Object Categories
1,000,000 1 1,000,000 89
Table 2: Basic statistics of the SBU1M image description dataset.
Flickr8k.

As with SBU1M, images in the Flickr8k66 6 http://hockenmaier.cs.illinois.edu/8k-pictures.html (?) dataset are also retrieved from Flickr5. However, unlike the automated way of collection of SBU1M, the images in Flickr8k are selected through user queries for specific objects and actions using the Amazon Mechanical Turk (AMT) platform. The images are then captioned by annotators on AMT such that each image contains five captions that are independently created. Table 3 presents the so-called karpathy split77 7 https://cs.stanford.edu/people/karpathy/deepimagesent of the dataset.

Split Images Captions per Image Total Captions
Training 6,000 5 30,000
Validation 1,000 5 5,000
Test 1,000 5 5,000
Total 8,000 5 40,000
Table 3: Splits of the Flickr8k image description dataset.
Flickr30k.

Flickr30k88 8 http://hockenmaier.cs.illinois.edu/Denotation.html (?) is an extended version of the previously published Flickr8k dataset, containing images collected from Flickr5 and captions obtained via crowdsourcing using AMT platform, following the same strategies employed in Flickr8k. Table 4 presents the previously-mentioned karpathy split7 of the dataset.

Split Images Captions per Image Total Captions
Training 29,000 5 145,000
Validation 1,014 5 5,070
Test 1,000 5 5,000
Total 31,014 5 155,070
Table 4: Splits of the Flickr30k image description dataset.
Flickr30k-Entities.

Flickr30k-Entities99 9 http://bryanplummer.com/Flickr30kEntities (?) extends Flickr30k with manually annotated bounding boxes for images and entity mentions in the captions in order to accomplish the task of language grounding in images, viz. phrase localization, while performing captioning. Specifically, there are 275,775 bounding boxes for the images of Flickr30k and 513,644 entity mentions in the 158k captions of Flickr30k. One peculiarity of this dataset is that it comes with 244k co-reference chains, in which each chain is a link between the mentions of the same entities across the five different captions of a given image. Some statistics and karpathy split7 of this dataset is presented in Table 5.

Num. of Object Objects Objects Captions Total
Split Images Categories per Category per Image per Image Captions
Training 29,783 - - - 5 148,915
Validation 1,000 - - - 5 5,000
Test 1,000 - - - 5 5,000
Total 31,783 44,518 6.2 8.7 5 158,915
Table 5: Splits and statistics of the Flickr30k-Entities image description dataset.
MSCOCO.

MSCOCO2 (?) is a widely-used and considerably larger-scale dataset than the image captioning datasets discussed so far. It contains natural images that are collected from Flickr5. The AMT platform is then used to curate and collect descriptions for the images. This dataset does not have an official split, hence the karpathy split7 from the above datasets is commonly used in the vision and language research community. The statistics and splits of the dataset can be found in Table 6.

Split Images Captions per Image Total Captions Object Categories
Training 113,287 5 566,435 -
Validation 5,000 5 25,000 -
Test 5,000 5 25,000 -
Total 123,287 5 616,435 80
Table 6: Splits of the MSCOCO image description dataset.
MSCOCO-Entities.

MSCOCO-Entities1010 10 https://github.com/aimagelab/show-control-and-tell  (?) is a recently-introduced dataset based on the original MSCOCO (?) dataset, with the goal of achieving the twin challenges of grounding and controllability in generated image captions. Unlike Flickr30k-Entities, the grounding annotations in this dataset are obtained in a semi-automated way. Table 7 presents some statistics about the dataset as well as its split.

Split Images Total Captions Noun chunks Noun chunks per caption Unique Classes
Training 113,287 545,202 1,518,667 2.79 1,330
Validation 5,000 7,818 20,787 2.66 725
Test 5,000 7,797 20,596 2.64 730
Table 7: Splits and statistics of the MSCOCO-Entities image description dataset.
STAIR Captions.

STAIR Captions1111 11 http://captions.stair.center (?) is a large-scale Japanese image captioning dataset that provides Japanese language descriptions for the 164,062 images of MSCOCO, while retaining the same dataset splits, viz. karpathy split7, as with MSCOCO (see Table 6). The annotation of captions is done manually using crowdsourcing. Original statistics from the authors of the dataset is provided in Table 8.

Total Num. Captions Total Num. Vocabulary Avg. Number
of Images per Image of Captions Size of Chars
164,062 (123,287) 5 820,310 (616,435) 35,642 (31,938) 23.79 (23.80)
Table 8: Statistics of the STAIR Captions image description dataset (Japanese). Public part of the dataset is indicated in brackets.
Multi30k-CLID.

The Multi30k-CLID1212 12 https://www.statmt.org/wmt16/multimodal-task.html (?) dataset was designed for the task of Cross-Lingual Image Description (CLID) generation with an ultimate goal of pushing existing vision and language research towards multilingual multimodal language processing. In the first edition of the task in 2016, the Flickr30k-Entities9 dataset (?) was extended to the German language by crowdsourcing the descriptions independently from their English language counterparts with the help of professional translators. As with original Flickr30k, each image comes with five descriptions in German. Hence, the English-German pairs are considered as comparable, though not parallel, corpora. The splits of this dataset for English and German languages can be found in Table 9.

Language of the Captions
Split Images English German
Training 29,000 145,000 145,000
Validation 1,014 5,070 5,070
Testing 1,000 5,000 5,000
Table 9: Splits of the Multi30k-CLID (2016) dataset.

In the second version1313 13 https://www.statmt.org/wmt17/multimodal-task.html of the task in 2017, the Flickr30k-Entities9 dataset was further extended to support French language captions (?). The annotations were again obtained via crowdsourcing following the same principles as with the previous version. Table 10 presents the number of instances in each language and the splits of the dataset.

Language of the Captions
Split Images English French German
Training 29,000 145,000 145,000 145,000
Validation 1,014 5,070 5,070 5,070
Testing 1,000 5,000 5,000 5,000
Table 10: Splits of the Multi30k-CLID (2017) dataset.

Similar to the earlier editions of the task, in the 2018 version1414 14 http://www.statmt.org/wmt18/multimodal-task.html Czech language translations of the captions were added (?). Following the same strategy of the prior versions of this dataset for obtaining annotations, human translators were employed to produce Czech translations for the captions of Flickr30k-Entities9. Table 11 presents splits and statistics of all four languages of the dataset.

Language of the Captions
Split Images Czech English French German
Training 29,000 145,000 145,000 145,000 145,000
Validation 1,014 5,070 5,070 5,070 5,070
Testing 1,071 5,355 5,355 5,355 5,355
Table 11: Splits and statistics of the Multi30k-CLID (2018) dataset.
Conceptual Captions (CC).

Conceptual Captions1515 15 https://ai.google.com/research/ConceptualCaptions/download (?) is a recently introduced web-scale dataset containing more than 3.3M images paired with English language captions. The dataset was harvested from the web in an automatic manner in which the captions were extracted from the Alt-text of retrieved HTML webpages. As a consequence, contrary to other curated image captioning datasets in which each image is paired with five captions, the images in CC have only one description, a fact that is evident in Table 12 which also presents the dataset splits.

Split Images Captions
Training 3,318,333 3,318,333
Validation 15,840 15,840
Test 22,530 22,530
Table 12: Splits of the Conceptual Captions dataset.

Although it is of large scale with a wider variety and style of captions, continued availability of the dataset for downloading by future users is an issue, primarily due to the fact the dataset is distributed as a CSV file containing URLs of images. Thus, it inherently suffers from the problem of URLs becoming stale (for instance due to contents being removed, unresponsive requests, etc.), and this puts it at a disadvantage.

Personality Captions (PC).

Personality Captions1616 16 https://parl.ai/projects/personality_captions/ (?) is a large scale image caption dataset that comes with so-called personality traits that are useful for controllable and style-based image captioning. Thus, the samples in the PC dataset are provided as triplets (image, personality trait, caption). Basic statistics such as vocabulary size, including the dataset splits, is provided in Table 13.

Num. of Captions Num. of Personality Vocabulary Avg. Tokens
Split Images per Image Captions Types Size per Caption
Training 186,858 1 186,858 215 33,641 11.2
Validation 5,000 1 5,000 215 5,460 10.9
Test 10,000 5 50,000 215 16,655 11.1
Table 13: Splits and statistics of the Personality Captions dataset.

3.1.3 Image Description Generation - Evaluation Measures, Models, and Results

In this section, we describe only the evaluation measures which are used for the task of Image Description Generation, as Models, Results, and some Discussion have been broadly presented in recent surveys (?).

Evaluation Measures.

We divide the evaluation measures into three different categories, where the first set of measures is “Language Metrics”, the second category is “Retrieval Metrics”, and the third category denotes “Human Evaluation”.

“Language Metrics” evaluate the machine-generated text based on reference text using word overlaps and are presented in the following.

  • Bilingual Evaluation Understudy (BLEU) (?) was originally developed for machine translation to compare machine generated output with human Ground Truth (GT). BLEU calculates the overlap between predicted unigrams (BLEU-1 (B-1)), or, more generally, n-grams (BLEU-2 (B-2), BLEU-3 (B-3), BLEU-4 (B-4), and so on.) from the set of candidate reference sentences. To achieve a high BLEU score, generated descriptions should match the human GT words as well as their order. Maximum achievable BLEU score is 1.0 (or sometimes equivalently 100) for an exact match between generated and reference sentence.

  • Metric for Evaluation of Translation with Explicit Ordering, popularly known as METEOR (?) has overcome some issues with BLEU, such as the need for exact word matching. METEOR performs semantic matching by leveraging WordNet to match words at various levels, using synonymy and paraphrase matching. The METEOR score is computed using the alignment between the machine generated output and the corresponding reference sentences. Initially, the set of unigrams from the generated and reference sentences is used to perform alignment. If there are multiple options available for alignments between the generated and reference sentence, the alignment setting with least comparisons is preferred. After finalizing the alignment process, the METEOR score is calculated.

  • Recall Oriented Understudy for Gisting Evaluation (ROUGE) (?) was designed to evaluate textual summaries. As opposed to BLEU, which concentrates on n-gram precision, ROUGE calculates the recall score of the generated sentences corresponding to the reference sentences. The most prominent ROUGE variant used is ROUGE-L, which is based on the longest common subsequence. Other variants include ROUGE-W (Weighted Longest Common Sub-sequence) and ROUGE-S (Skip-Bigram Co-Occurrences Statistics). One advantage of ROUGE-L over BLEU and METEOR is that it checks for subsequences within a sentence. Moreover, specifying the n-gram length (as required in BLEU) is not necessary as it is automatically incorporated.

  • Consensus-based Image Description Evaluation (CIDEr) (?) evaluates the consensus between a generated sentence and a set of reference sentences by performing different language pruning techniques, such as stemming and building a set of n-grams. N-grams that are common among the reference sentences of all visual data are given lower weight, as they are less informative about the visual content, and biased towards the textual content of the sentences. The weight for each n-gram is computed using Term Frequency Inverse Document Frequency (TF-IDF), where TF puts higher weight on frequently occurring n-grams in the reference sentence of the visual content, whereas IDF puts lower weight on commonly appearing n-grams across the whole dataset. To remove the mismatch between human evaluation and CIDEr scores, a variant of CIDEr, CIDEr-D, is used. It adds small variations, such as not stemming and ensuring that the words with high confidence are not repeated in a sentence by introducing a Gaussian penalty over length differences between the generated and reference sentences. As in the case of vanilla CIDEr, it produces high scores even if the sentences do not make sense.

  • Semantic Propositional Image Captioning Evaluation (SPICE) (?) measures the similarity between the scene graph tuples parsed from generated sentences and human created GT sentences. The scene graph encodes objects and their relationships through dependency parsing. Hence, it makes SPICE heavily dependent on parsing, which can be prone to errors. Similar to METEOR, SPICE uses WordNet to find and treat synonyms as positive matches when computing the F1 score between the tuples of generated sentences and the ground truth.

“Retrieval Metrics” evaluate the machine generated text based on standard information retrieval measures (?) and are presented in the following paragraphs.

  • [email protected] ([email protected])’s goal is to evaluate the number of relevant ground truth sentences retrieved in the Top-k (e.g., Top-1, Top-5 etc.) candidates. A higher [email protected] indicates better performance.

  • Median Rank (MedRank) finds the median rank value of the retrieved ground truth. A lower MedRank value indicates better performance.

  • Mean Reciprocal Rank (MRR) is a binary measure, where the rank of the highest ranking relevant document for a query is used to calculate the reciprocal rank averaged over all queries. A higher MRR indicates better performance.

  • Mean Rank (Mean) refers to the mean rank achieved in retrieving the relevant sentence. A lower Mean value is better.

  • Normalized Discounted Cumulative Gain (NDCG) is a variant of Discounted Cumulative Gain (DCG) (?). NDCG is a cumulative, multilevel measure of ranking quality that is usually truncated at a particular rank level.

“Human Evaluation” employs crowd-workers to evaluate the quality of the generated content and is described in the following paragraph.

  • Human Evaluation The earlier mentioned metrics provide only quantitative measures for evaluating different tasks. However, due to the lack of high correlation between machine-generated textual or visual data with the human provided GT, most of the tasks require human evaluations to judge the quality of the content. To perform evaluation based on the task, various kinds of instructions are given to human evaluators. Most of the tasks are interested in finding relevance of the output to input. Also, they evaluate the preferred method based on the generated output.

3.1.4 Video Description Generation - Introduction

Going beyond images, the goal of video captioning is to comprehend the spatio-temporal information in a video for generating either one or multiple textual descriptions. As with image description generation (Section 3.1.1), in the following, we explore some of the popular categories observed in video description generation.

Global Video Description Generation.

Global video description generation approaches (?, ?) initially started by grounding sentences that describe actions in the visual information extracted from videos. It was further expanded into generating global natural language descriptions for videos with various approaches, for example, leveraging latent topics (?), corpora knowledge (?), graphical models (?), and sequence-to-sequence learning (?, ?, ?, ?, ?, ?, ?). Figure 3 depicts the description generation task for a complete video.

Figure 3: Given a video (represented as sequence of frames), the Video Caption Generation Model generates a single global description.

The aforementioned approaches leverage only those training datasets with a limited set of visual objects. However, the recognition and description of entities and activities in real-world videos is more difficult. Nevertheless, generating natural language descriptions for such videos is addressed with a factor graph by combining visual detection with language statistics (?). Additionally, sequence-to-sequence (seq2seq) based approaches have been improved with external corpora (?) and also using attention with various techniques such as soft-attention (?), multimodal fusion (?), temporal attention (?), semantic consistency (?), and residual connections (?). Apart from attention-based methods, novel architectures have also been explored, such as incorporation of semantic attributes learned from videos (?), ensemble-based description generator networks (?) and encoder-decoder-reconstructors which leverage both the forward and backward flows, i.e., video-to-description and description-to-video, for video captioning (?). Multi-faceted attention has also been used to select the most salient visual features or semantic attributes, with which an overall sentence is generated (?). Apart from architecture improvements, different machine learning approaches have also been explored. Video captioning has been tackled using a multi-task learning scenario by sharing knowledge between two related tasks (such as temporal- and context-aware video) combined with entailment generation task (?). Other approaches have leveraged reinforcement learning, either by providing entailment rewards (?) , or to address the description generation for multiple fine-grained actions (?). Further, ? (?) proposed a deep network designed to detect inaccuracies in a sentence, and fix them by replacing the inaccurate word(s) with the help of a Visual Text Correction system. Recently, Zhang et al. ? (?) introduced an object relational graph (ORG) based encoder which encapsulates the relation among visual objects to build richer representation and a decoder the integrates the external language model to capture abundant linguistic knowledge for efficient video description generation. In the following, we discuss some related ideas which expand the scope of video description generation.

Dense Video Description Generation.

The aim of dense video description generation is to achieve fine-grained video understanding by addressing two sub-problems: (1) localizing events in a video, and (2) generating captions for these localized events (?, ?). Further, extending earlier research, some approaches (?) have explicitly linked the sentence to a corresponding bounding box in one of the frames of a video by annotating each of the noun phrases observed in the sentence. Incorporating background knowledge for video description generation is also another line of research (?). However, the core challenge, namely the automatic evaluation of video captioning, is still unsolved. It is currently being studied from the perspective of direct assessment with the help of human assessors (?).

Movie Description Generation.

Movie description generation sees the task of video description generation from a different perspective, in which movie clips are used as inputs. Initially, aligning books to movies (?, ?) was used to generate storylike explanations. Later, movie descriptions (?) were directly created by transcribing audio descriptions by concentrating on precisely describing what is shown in the movie.

3.1.5 Video Description Generation - Datasets

Similar to the image description generation task, several datasets have been created to address the task of video description generation. In the following, we cover those datasets that are popular and extensively used. For the sake of brevity, we denote hours h, minutes m, and seconds s.

Microsoft Video Description (MSVD).

MSVD1717 17 http://www.cs.utexas.edu/users/ml/clamp/videoDescription  (?) is an open domain dataset collected from YouTube clips and annotated using AMT. The dataset is multilingual and contains human generated descriptions in languages such as German, English, Chinese, etc. On average, there are forty-one single sentence descriptions per clip. More statistics about the dataset are presented in Table 14 whereas Table 15 presents its split.

Total Total Total Avg. Total Total Total Vocabulary
Videos Classes Length Length Clips Sentences Words Size
1,970 218 5.3 h 10 s 1,970 70,028 607,339 13,010
Table 14: Statistics of the MSVD dataset.
Split Frames Videos
Training 33,682 1,200
Validation 3,275 100
Test 20,528 670
Total 57,485 1970
Table 15: Splits of the MSVD dataset.
MPII Cooking Activities.

The MPII Cooking1818 18 https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/human-activity-recognition/mpii-cooking-activities-dataset  (?) dataset consists of 65 different cooking activities such as “wash hands”, “put in bowl”, etc., when participants are preparing one of 14 dishes such as fruit salad, casserole, etc. The dish preparation time ranges between 3 and 41 minutes. The videos are recorded in high resolution (1624x1224), following which the activity annotations are manually created by 6 people. Table 16 presents more statistics about the dataset whereas the splits of it can be found in Table 17.

Num. of Total Total Total Video Total Num. of Total Activity
Subjects Clips Videos Frames Length Length Activities Dishes Annotations
12 5,609 44 881,755 3 to 41 m 8.0 h 65 14 5,609
Table 16: Statistics of the MPII Cooking Activities dataset.
Split Frames Subjects
Training 1,071 10
Validation - -
Test 1,277 7
Table 17: Splits of the MPII Cooking dataset.
YouCook.

YouCook1919 19 http://web.eecs.umich.edu/~jjcorso/r/youcook  (?) is a more complex real-world cooking dataset when compared to MPII Cooking in which the complexity arises because of dynamic scene and camera changes. The videos are all downloaded from YouTube and are broadly categorized into 6 different cooking styles, viz. baking, grilling, etc. Video descriptions are obtained via crowdsourcing using AMT. On average, eight descriptions are collected per video. Frames are annotated with objects belonging to categories (such as bowls, utensils, etc.) and actions. More details and splits of the dataset can be found in Table 18 and Table 19 respectively.

Cooking Object Total Total Num. of Num. of Vocabulary
Styles Classes Videos Length Sentences Words Size
6 10 88 2.3 h 2,688 42,457 2,711
Table 18: Statistics of the YouCook dataset.
Split Videos
Training 49
Validation -
Test 39
Table 19: Splits of the YouCook dataset.
YouCook II.

Similar to the YouCook dataset, YouCook II2020 20 http://youcook2.eecs.umich.edu  (?) also consists of instructional cooking videos that are all collected from YouTube. The videos include 89 cooking recipes from four regions: South Asia, East Asia, Europe/Middle East, and America. One unique aspect of this dataset when compared to previously discussed video description datasets is that that the videos are annotated with procedure segments that contain rich semantic information. Table 20 presents the statistics about the dataset.

Cooking Total Total Video Avg. Video Procedure Total Num. of Vocab.
Recipes Videos Length Length Seg. per Video Clips Sentences Size
89 2,000 175.6 h 316 s 3-16 15,400 15,400 2,600
Table 20: Statistics of the YouCook II dataset

For each recipe, the videos are randomly split into training, validation, and testing in ratios of 67%, 23%, and 10% respectively. The actual numbers are presented in Table 21.

Split Videos
Training 1,340
Validation 460
Test 200
Table 21: Splits of the YouCook II dataset.
Textually Annotated Cooking Scenes (TACoS).

The TACoS2121 21 http://www.coli.uni-saarland.de/projects/smile/page.php?id=tacos  (?) dataset is an extended version of a subset of MPII Composites (?) which contains cooking videos that are each annotated with multiple textual descriptions. It contains only those videos that include activities such as manipulation of cooking ingredients. Around 26 cooking activities are collected with 127 videos. More statistics on the dataset is presented in Table 22 and Table 23. For building and evaluating models, the dataset is split into 50% for training, 25% for validation, and 25% for testing.

Total Total Descriptions Annotation Annotations Cooking Action
Videos Clips per Video Assignments after filtering Tasks/Dishes Descriptions
127 7,206 20 2,540 2,206 26 17,334 (tokens)
Table 22: The TACoS dataset statistics - I
Sentence Total Content Words Num. of Num. of
Types Words (viz. nouns, verbs, adjectives) Verbs (tokens) Verbs (lemmas)
11,796 146,771 75,210 28,292 435
Table 23: The TACoS dataset statistics - II
TACoS-MultiLevel.

The TACoS dataset was extended into TACoS-MultiLevel2222 22 https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus  (?) by collecting three levels of descriptions constituting (i) 15 detailed descriptions per video, (ii) 3-5 short descriptions, and (iii) a single sentence description, using AMT. Overall, the dataset comes with 2,600 triplets of descriptions. Further statistics on the dataset can be found in Table 24.

Total Total Total Video Avg. Number of Total
Videos Clips Length Length Sentences Words
185 14,105 27.1 h 360 s 52,593 2,000
Table 24: Statistics of the TACoS-MultiLevel dataset.
MPII Movie Description (MPII-MD).

The MPII-MD2323 23 https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/mpii-movie-description-dataset (?) dataset contains clips extracted from Hollywood movies and their transcribed audio descriptions. In addition, each clip is paired with a single sentence that is extracted from the script of the movie. Furthermore, transcribed audio is associated with spoken sentences by using timestamps. Misalignment between the audio and visual content is handled by leveraging manual annotation. Table 25 presents the statistics of the dataset.

Unique Before alignment After alignment
Movies Words Words Sentences Clips Avg. Length Total
Audio Desc. 55 346,557 332,846 37,272 37,266 4.1 s 42.5 h
Movie script 50 398,072 320,621 31,103 31,071 3.6 s 31.1 h
Total 94 744,629 653,467 68,375 68,337 3.9 s 73.6 h
Table 25: Statistics of the MPII-MD dataset.

For the task of video description, the MPII-MD dataset is split as follows: 11 movies with associated scripts and audio descriptions (in total 22 alignments, 2 per movie) are used as validation (8) and test sets (14). The remaining 83 movies are used for training purposes.

Montreal Video Annotation Dataset (M-VAD).

M-VAD2424 24 https://mila.quebec/en/publications-archive/public-datasets/m-vad/ (?) is a large Descriptive Video Service (DVS)-derived video dataset that is created using 92 Movies, covering a wide variety of genres. It is collected in a semi-automatic manner with minimal human intervention. The words in the descriptions are annotated with Part-Of-Speech (POS) tags using the Stanford POS tagger. Around 500 proper names are removed from the corpus, since learning proper names is not interesting for a video description model.

Type Movies Words Paragraphs Sentences Avg. Length Total
Un-filtered 92 531,778 52,683 59,415 6.3 s 91 h
Filtered 92 510,933 48,986 55,904 6.2 s 84.6 h
Table 26: Statistics of the M-VAD dataset.

Table 26 presents some statistics about the dataset, while Table 27 presents the official dataset split that balances the genre within each split.

Split Video Clips
Training 38,949
Validation 4,888
Test 5,149
Table 27: Splits of the M-VAD dataset.
MSR Video to Text (MSR-VTT).

MSR-VTT2525 25 http://ms-multimedia-challenge.com/2017/dataset (?), also known as MSR-VTT-10k, is a large-scale video dataset containing automatically crawled videos belonging to 20 categories for the task of video description generation. The sentence annotations are obtained via crowdsourcing using AMT. In addition to the video content, the dataset also contains audio information. Table 28 presents more statistics about the dataset.

Categories Videos Clips Sentences per Clip Sentences Words Vocab. Duration
20 7,180 10,000 20 200,000 1,856,523 29,316 41.2 h
Table 28: Statistics of the MSR-VTT dataset.

Out of 7.2k videos, 30k video clips have been created. However, only a random subset of 10k clips has been released. The dataset is split in the ratio of 65%:30%:5% for training, validation, and testing. Specific numbers are presented in Table 29.

Split Video Clips
Training 6,513
Validation 497
Test 2,990
Table 29: Splits of the MSR-VTT dataset.
Videos Titles in the Wild (VTW).

VTW2626 26 http://aliensunmin.github.io/project/video-language/index.html#VTW (?) is a large-scale dataset of automatically crawled user-generated YouTube videos paired with titles and descriptions. The video clips are on average 90 seconds in duration and are described with one sentence per clip to enable video title generation. It also comes with augmented sentences that contain information that may not be present in the video clip. More statistics of the dataset can be found in Table 30.

Dataset Sentences Vocab. Sentences/Word Nouns Verbs Adjective Adverb
VTW-title 18,100 8,874 2.0 5,850 2,187 1,187 224
VTW-full 44,603 23,059 1.9 13,606 6,223 3,967 846
Table 30: Statistics of the VTW dataset.

Similar to M-VAD, the dataset is randomly split into 80% for training and 10% each for validation and testing. Specific numbers are presented in Table 31.

Split Videos Sentences/Titles
Training 14,100 14,100
Validation 2,000 2,000
Test 2,000 2,000
Table 31: Splits of the VTW dataset.
ActivityNet Captions (ANetCap).

ANetCap2727 27 http://activity-net.org/challenges/2017/captioning.html (?) is a large-scale video dataset2828 28 https://cs.stanford.edu/people/ranjaykrishna/densevid that extends a subset of videos from ActivityNet with dense descriptions. There are multiple descriptions for every video and the videos contain multiple events occurring at the same time. Another notable aspect of this dataset is that the descriptions focus more on actions happening in videos. As a result, this dataset falls under the category of being more action-centric than object-centric.

Videos Total Video Hours Avg. Video Length Sentences Avg. Sentence Length
20,000 849 180 s 100,000 13.48 (words)
Table 32: Statistics of the ANetCap dataset.

Table 32 presents more statistics on the dataset, while Table 33 presents its split.

Split Videos
Training 10,024
Validation 4,926
Test 5,044
Table 33: Splits of the ANetCap dataset.
ActivityNet Entities (ANetEntities).

The ANetEntities2929 29 https://github.com/facebookresearch/ActivityNet-Entities (?) dataset augments ANetCap (?) with manually annotated bounding boxes, and was created for the task of grounding language in videos while generating descriptions. It adds around 158k bounding box annotations on ANetCap, each grounded to a Noun Phrase (NP) in the sentence description. More statistics and the dataset splits can be found in Table 34.

Split Videos Sentences Objects Bounding Boxes
Training 10,000 35,000 432 105,000
Validation 2,500 8,600 427 26,500
Test 2,500 8,500 421 26,100
Total 15,000 52,100 432 157,600
Table 34: Statistics and splits of the ANetEntities dataset.
COmprehensive INstructional video analysis (COIN).

COIN3030 30 https://coin-dataset.github.io (?) is a large-scale dataset of instructional YouTube videos from 12 domains such as vehicles, gadgets, sports, etc., that are common in our daily lives. It is aimed at overcoming two limitations of current instructional video datasets, namely diversity and scale. It covers over 180 tasks in 12k videos.

Num. of Num. of Total Total Total Avg. Video Avg. Segment
Domains Tasks Videos Segments Duration Length Length
12 180 11,827 46,354 476 h, 38 m 2.36 m 14.91 s
Table 35: Statistics of the COIN dataset

One unique aspect of this dataset is that it introduces a three-level hierarchy, viz. domain, task, and step, for organizing videos. Table 35 shows some statistics of the dataset whereas Table 36 presents training and validation splits of COIN.

Split Videos
Training 9,030
Validation -
Test 2,797
Table 36: Splits of the COIN dataset.
HowTo100M.

HowTo100M3131 31 https://www.di.ens.fr/willow/research/howto100m (?) is a large-scale dataset of narrated videos with emphasis on instructional YouTube videos where the video creators teach complex tasks with an explicit intention of explaining the visual content on screen. The dataset includes a wide variety of 23k activities from the domains such as gardening, personal care, fitness, hand crafting, cooking, etc. and is three orders of magnitude than the previously discussed video description datasets. Table 37 presents more statistics about the dataset.

Num. of Num. of Total Total Total Total Avg. Video Avg. Clip-Caption
Domains Tasks Videos Clips Duration Captions Length Pairs per Video
12 23,611 1.221M 136M 134,472 h 136M 6.5 m 110
Table 37: Statistics of the HowTo100M dataset

This dataset has not yet been used for the task of video description generation. Hence, an official dataset split is not available for evaluation purposes.

3.1.6 Video Description Generation - Evaluation Measures, Models, and Results

In this section, we describe only the evaluation measures which are used for the task of Image Description Generation as Models, Results, and some Discussion have been broadly discussed in recent surveys (?).

Evaluation Measures.

The measures used for Video Description Generation are the same as the Language metrics and Retrieval metrics used in Image Description Generation and are presented in the Section 3.1.3.

3.2 Visual Storytelling

The task of visual storytelling aims to encode a sequence of images or frames (in the video) to generate a paragraph which is story-like. This is usually considered more beneficial than generating a paragraph from a single image or video.

3.2.1 Image Storytelling - Introduction

The aim of image storytelling is to generate stories from a sequence of images. Although sequence of images can be perceived as a video, consecutive images in the streams can have sharp changes of visual content, which can cause an abrupt discontinuity between consecutive sentences (?). Hence, it is seen as a sequential vision-to-language task (?) where images are not considered in isolation. Figure 4 summarizes image storytelling where a story in a sequence is generated.

Figure 4: Given a sequence of images, the Image Storytelling Model generates a textual story in sequence.

Initially, semantic coherence in a photo stream is captured by reducing the visual variance. Further, the semantic space is acquired by jointly embedding each photo with its corresponding contextual sentence such that their correlations are discovered (?). It was then improved by exploiting hierarchical architecture (?) and further optimized by incorporating reinforcement learning with rewards (?) for generating relevant and expressive narrative paragraphs. Instead of flat deep reinforcement learning, a hierarchically structured reinforced training has also been studied (?) and has been shown to achieve significantly better performance than with a flat structure. Similarly,  ? (?) used adversarial reward learning to learn an implicit reward function from human demonstrations to optimize policy search with the learned reward function. Nevertheless, the standard form of narration suffers from repetitiveness, with the same objects or events serving to undermine a good story structure. Hence, inter-sentence diversity was explored with diverse beam search to generate more expressive stories (?). The task has also been approached from a different perspective, in which, given a jumbled set of aligned image-description pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story (?). While earlier research addresses only natural images, some approaches (?) also incorporated medical domain knowledge to generate realistic and accurate descriptions for medical images.

3.2.2 Image Storytelling - Datasets

There are not many datasets created to address the creative task of image storytelling. In the following, we cover all datasets that have been used to advance this interesting and challenging problem.

New York City Storytelling (NYC-Storytelling).

The NYC-Storytelling3232 32 https://github.com/cesc-park/CRCN (?) dataset was created from blogs in which users post their travelogues. The dataset is collected in a semi-automatic manner: automatic crawling followed by manual selection of travelogues and finally preprocessing using the NLTK3333 33 https://www.nltk.org library. For evaluation purposes, the dataset is split in a ratio of 8:1:1 for training, validation, and testing respectively. Table 38 presents the minimal statistics of the dataset.

Images Blog posts
78,467 11,863
Table 38: Statistics of the NYC-Storytelling dataset.
Disneyland Storytelling.

Similar to NYC-Storytelling, the Disneyland Storytelling dataset is also based on blogs documenting travelogues but specifically about Disneyland Park. This dataset was originally created by (?) but has been reused for visual storytelling tasks. The same ratio of data splits as with the NYC-Storytelling dataset is used for evaluation purposes. The minimal statistics of the dataset can be found in Table 39.

Images Blog posts
60,545 7,717
Table 39: Statistics of the Disneyland-Storytelling dataset.
Sequential Image Narrative Dataset (SIND).

SIND (?) is the first large-scale dataset created for the task of image storytelling. Natural language descriptions of the dataset are divided into three types: (i) Descriptions of Images-in-Isolation (DII), (ii) Descriptions of Images-in-Sequence (DIS), and (iii) Stories for Images-in-Sequence (SIS). The stories are collected via crowdsourcing using AMT. Similar to other image storytelling datasets, this dataset is split into 80%, 10%, and 10% for training, validation, and testing purposes respectively. Table 40 presents the statistics of the dataset.

Images Flickr Albums (Text, Image) Vocab
DII - - 151,800 13,800
DIS - - 151,800 5,000
SIS - - 252,900 18,200
Total 210,819 10,117 - -
Table 40: Statistics of the SIND dataset.
Visual Storytelling Dataset (VIST).

VIST3434 34 http://visionandlanguage.net/VIST is the second version (v.2) of SIND (see Section 3.2.2) and is aimed at modeling the social language of humans for evolving AI to be more human-like in understanding. Basic statistics of the dataset are shown in Table 41 while the splits of it can be found in Table 42.

Images Text Sequences
81,743 10,117
Table 41: Statistics of the VIST (SIND v.2) dataset.
Split Stories Sentences
Training 40,155 200,775
Validation 4,990 24,950
Test 5,055 25,275
Table 42: Splits of the VIST dataset.

3.2.3 Image Storytelling - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different Image Storytelling models and the results obtained by them.

Evaluation Measures.

To evaluate Image Storytelling models, the Language metrics and Retrieval metrics presented in Section 3.1.3 are used.

Models.

Many models have been created to handle the task of Image Storytelling. In Table 43, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both image and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Approach Image Language Combined Optimizer RL
 (?) AlexNet LM MLBL -
 (?) VGG RNN NeuralTalk RMSprop
 (?) GoogLeNet LSTM NIC SGD
 (?) VGG RNN CRCN RMSprop
 (?) VGG GRU Story-Flat -
 (?) VGG LSTM HierarchicalRNN ADAM
 (?) VGG LSTM BARNN -
 (?) VGG LSTM GAN ADAM
 (?) ResNet-152 GRU AREL ADAM
Table 43: Exemplar Image Storytelling architectures.
Results.

In Table 44, Table 45, Table 46, and Table 47 we present the results obtained with a subset of models which use the datasets presented earlier in Section 3.2.2.

Model B-4 CIDEr METEOR [email protected] [email protected] MedRank
MLBL (?) 0.01 2.6 5.29 1.19 4.52 100.5
NeuralTalk (?) 0.00 0.5 1.34 0.48 2.86 120.5
NIC (?) 0.10 9.1 5.73 0.95 7.38 88.5
CRCN (?) 2.08 30.9 7.69 11.67 31.19 14.00
Story-Flat (?) - - 7.37 - - -
HierarchialRNN (?) - - 6.07 - - -
BARNN (?) - 41.6 - 29.37 45.43 8
AREL (?) - - 8.39 - - -
Table 44: Results obtained with different models on the NYC-Storytelling dataset.
Model B-4 CIDEr METEOR [email protected] [email protected] MedRank
MLBL (?) 0.01 3.4 4.99 1.02 4.08 62
NeuralTalk (?) 0.00 0.4 1.34 1.02 3.40 88
NIC (?) 0.07 10.0 4.51 2.83 10.38 61.5
CRCN (?) 3.49 52.7 8.78 14.29 31.29 16
Story-Flat (?) - - 7.61 - - -
HierarchialRNN (?) - - 7.72 - - -
BARNN (?) - 54.1 - 35.01 49.07 6
AREL (?) - - 9.90 - - -
Table 45: Results obtained with different models on the Disneyland-Storytelling dataset.
Model B-4 CIDEr METEOR [email protected] [email protected] MedRank
CRCN (?) - - - 9.87 28.74 21
Story-Flat (?) 3.50 6.84 10.25 - - -
HierarchialRNN (?) 3.7 6.51 9.97 - - -
AREL (?) 5.16 11.35 12.32 - - -
Table 46: Results obtained with different models on the SIND dataset.
Model B-4 CIDEr METEOR [email protected] [email protected] MedRank
enc-attn-dec (?) - 4.96 32.98 - - -
h-attn-rank (?) - 7.38 33.94 - - -
BARNN (?) - - 33.32 24.07 44.29 9
AREL-t-100 (?) 14.1 9.4 35.0 - - -
Table 47: Results obtained with different models on the VIST dataset.

3.2.4 Image Storytelling - Discussion

We observe that for Image Storytelling, the adversarial approach, i.e., Adversarial REward Learning (AREL) proposed by ? (?), obtains best results on both retrieval and language metrics for different datasets. This attests to AREL’s ability to clone expert behaviors while still generating more human-like stories.

3.2.5 Video Storytelling - Introduction

In comparison to image storytelling, which only deals with a small sequence of images, the aim of video storytelling is to generate coherent and succinct stories for long videos. However, video storytelling is less explored. The video storytelling task was pioneered by ? (?) to address challenges such as diversity in the story and the inherent complexity of video. They introduced residual Bidirectional RNNs (BiRNN) for leveraging context and a narrator model with reinforcement learning. Further, ? (?) created a multi-sentence video description dataset (VideoStory) to resemble stories from social media videos. The goal of social media-specific video description generation was to offer support to people with visual disabilities or other technical issues such as internet bandwidth limitations. Figure 5 summarizes the task of video storytelling where a story in a sequence is generated based on a video as the input.

Figure 5: Given video frames (adopted from  (?)) as input, the Video Storytelling Model generates a textual story in sequence.

It is worth noting that this task bears close resemblance to the well-researched area of video summarization using only videos (?).

3.2.6 Video Storytelling - Datasets

Similar to image storytelling datasets, currently two different datasets are available to address the task of video storytelling. In the following, we elaborate on these two datasets.

VideoStory.

VideoStory (?) is a multi-sentence description dataset created from social media videos that are selected to be highly diverse and engaging. Table 48 presents more statistics on the dataset.

Total Total Total Avg. Video Total Sentences
Videos Length Clips Duration Sentences per Video
20,000 396 h 123,000 70s 123,000 4.67
Table 48: Statistics of the VideoStory dataset.

Models can be evaluated locally on the earmarked test set whereas test (blind) is reserved for online evaluation purposes. However, the dataset including annotations has not been made public yet. Table 49 presents actual number of videos, clips, and sentence annotations for each of the splits.

Split Videos Clips Paragraphs/video Paragraphs Words/paragraph
Training 17,098 80,598 1 17,098 61.76
Validation 999 13,796 3 2,997 59.88
Testing 1,011 14,093 3 3,033 59.77
Test (Blind) 1,039 14,139 3 3,117 69.45
Total 20,147 122,626 - 26,245 62.23
Table 49: Splits of the VideoStory dataset.
VideoStory-NUS.

The VideoStory-NUS3535 35 https://zenodo.org/record/2383739 (?) dataset contains social event videos that were collected from YouTube by querying for common and complex events, namely Birthday, Camping, Christmas, and Wedding. Specifically, it comes with 105 manually chosen videos with sufficient inter-event and intra-event variations which are annotated with descriptive stories obtained through AMT. Each video is annotated by at least 5 different AMT workers, thus resulting in 529 stories in total. More statistics of the dataset can be found in Table 50.

Avg. Video Avg. Story Avg. Sentence Vocab.
Domain Videos Length Length Length Size
Open 105 12 m 35 s 162.6 12.1 4,045
Table 50: Statistics of the VideoStory-NUS dataset.

For experimental purposes, the dataset is randomly split in a ratio of 14:3:3 for training, validation, and testing respectively. Actual numbers are presented in Table 51.

Split Percentage (%) Videos
Training 70 73
Validation 15 16
Test 15 16
Table 51: Splits of the VideoStory-NUS dataset.

3.2.7 Video Storytelling - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different Video Storytelling models and the results obtained by them.

Evaluation Measures.

To evaluate Video Storytelling models, the Language metrics and Retrieval metrics presented in Section 3.1.6 are used.

Models.

There are a number of different models available for the task of Video Storytelling. These models combine representations of video and language in an efficient manner to address the task. In Table 52, we present some exemplar architectures (refer to Combined column) created to accomplish the task by integrating both video and language inputs. To understand the optimization techniques used, we also include a column that showcases the optimization method used to train the models.

Approach Video Frame Language Combined Optimizer RL
 (?) C3D VGG GRU H-RNN RMSProp
 (?) R3D ResNet-101 GRU seq-seq+context ADAM
 (?) - ResNet-101 GRU ResBRNN ADAM
Table 52: Exemplar Video Storytelling architectures.
Results.

The Video Storytelling results showcases the efficacy of the proposed models. In Table 53 and Table 54 we present results obtained with a subset of models built using the datasets presented earlier in Section 3.2.6.

Model B-4 CIDEr METEOR [email protected] [email protected] MedRank
seq-seq+context (?) 1.20 9.37 33.88 - - -
Table 53: Results obtained with different models on the VideoStory dataset.
Model B-4 CIDEr METEOR [email protected] [email protected] MedRank
mRNN (?) 11.8 81.3 18.0 5.34 21.23 29
Deep Video-Text (?) 11.5 79.5 17.7 4.72 19.85 31
H-RNN (?) 16.1 64.6 15.5 - - -
ResBRNN (?) 14.7 94.3 19.6 7.44 25.77 22
ResBRNN-kNN (?) 15.6 103.6 20.1 - - -
Table 54: Results obtained with different models on the VideoStory-NUS dataset.

3.2.8 Video Storytelling - Discussion

For Video Storytelling, a different set of methods are used for comparing two datasets. In Table 53, we observe that only one method utilizing the sequence-to-sequence paradigm with contextual information (i.e., seq2seq+context) is evaluated on the “VideoStory” dataset. Nevertheless, another set of methods used for comparison for the “VideoStory-NUS” dataset is in Table 54. It shows that the approach proposed by ? (?) using Residual BRNN with k-Nearest Neighbours (i.e., ResBRNN-kNN) outperforms most of the baseline methods.

4 Visual Referring Expression

In this section, we explore the task of Visual Referring Expression. The objective of the task is to ground a natural language expression (e.g. a noun phrase or a longer piece of text) to objects in a visual input.

4.1 Image Referring Expression

In the following, we present more details about Visual Referring Expression by using an image as the visual input.

4.1.1 Image Referring Expression - Introduction

In a natural environment, people use referring expressions to unambiguously identify, indicate, or point to particular objects. This is usually done with a simple phrase or within a larger context (e.g. a sentence). Having a larger context provides better scope for avoiding ambiguity and allows the referential expression to easily map to the target object. However, there can also be other possibilities in which people are asked to describe a target object based on its surrounding objects. This provides us with two different possibilities for the visual referring expression task. In the first scenario, referring expressions deal with generation, in which an algorithm generates a referring expression for a given target object that is present in a visual scene. In the second scenario, the referring expression is used to perform comprehension, in which an algorithm locates in an image the object described by a given referring expression. Figure 6 shows an example for the task of referring expression comprehension.

Figure 6: Given an image and a referring expression, the Image Referring expression comprehension identify it in the image using bounding boxes.

Given these tasks, different approaches have been proposed for referring expression generation (?, ?), comprehension (?), and both combined (?, ?). Note that there is a difference between referring expression tasks and grounding of free-form textual phrases (?) in an image.

Referring Expression Generation.

An initial approach (?) viewed the problem from the perspective of density estimation, in which the goal was to learn distributions over logical expressions identifying sets of objects in the world. Other research designed a comprehension-guided referring expression generator (?) by using a comprehension module trained on human-generated expressions to generate referring expressions.

Referring Expression Comprehension.

? (?) investigated referring expression comprehension to integrate contexts between objects. Later on, techniques such as Multiple Instance Learning (MIL) were used to explore context regions and max-margin based MIL objective functions for training. Further,  ? (?) leveraged a natural language query of the object to localize a target object using a Spatial Context Recurrent Convnet (SCRC) model. It operates as a scoring function on candidate boxes for object retrieval, integrating spatial configurations and global scene-level contextual information. This explicit modeling of the referent and context region pairs has proven useful. Approaches such as compositional modular networks (?) analyzed referential expressions by identifying entities and relationships mentioned in the input expression and grounding them all in the scene. Such an approach has been shown to effectively inspect local regions and pairwise interactions between them. A modular approach was also explored where three modular components related to subject appearance, location, and relationship to other objects was used to model with Modular Attention Network (?). It has proven effective at focusing on the subjects and their relationships. Approaches such has GroundNet (?) have leveraged syntactic analysis of the input referring expression to build a dynamic computation graph of neural modules that definesan architecture for performing localization. Variational models have also been used for referential expression comprehension where variational Bayesian methods called variational context (?) were used to solve the problem of complex context modeling. These methods have proven capable of exploiting the relation between the referent and context, thereby reducing the search space of context. Furthermore, an accumulated attention mechanism (?) has been proposed to accumulate the attention for useful information in image, query, and objects. It has demonstrated the ability to reduce the redundancy and noise issues that were in other approaches. Recently, a Cross-Modal Relationship Extractor (CMRE) and a Gated Graph Convolutional Network (GGCN) were combined into a cross-modal relationship inference network (?). CMRE has been shown to highlight objects and relationships which have connections with a given referring expression, while GGCN computes multimodal semantic contexts by fusing information from different modes and propagating multimodal information through the structured relation graph. Coming from a perspective of natural language understanding, a Recursive Grounding Tree (?) sought to automatically compose a binary tree structure by parsing the referring expression, in order to perform visual reasoning along the tree in a bottom-up fashion. It has been shown to allow gradients from continuous score functions with a discrete tree construction. There has also been interest in combining visual reasoning with referential expressions through the creation of new dataset (?). Most of the above approaches use bounding box localization, but additionally object segmentation (?) has also been explored for referring expression comprehension.

Referring Expression Generation and Comprehension.

Few approaches have performed both generation and comprehension tasks. Visual context (?, ?) was initially used in referring expression models to find visual comparison to other objects within an image. It has shown significant improvements. Further, a unified framework (?) was designed using a speaker, a listener, and a reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions. Feedback from the discriminative reinforcer has has proven capable of benefitting the tasks. The role of attributes (?) was also studied to show that they help in disambiguation when referring to a particular object.

4.1.2 Image Referring Expression - Datasets

For the task of image referring expression, both real and synthetic image datasets have been designed. In the following, we present the details of the datasets separately.

Real Images.

In the real and natural images category, the ImageCLEF3636 36 https://www.imageclef.org/SIAPRdata and MSCOCO2 (see Section 3.1.2) datasets are commonly used for creating referring expression annotations. From a subset of ImageCLEF’s IAPR dataset\footreffnote: imageclef-iapr-dataset-url, referring expressions are collected in a game-based setting, namely ReferItGame\footreffnote: referitgame-url (?). The resulting dataset is called as RefCLEF\footreffnote: refcoco-dataset-github-url and its statistics can be found in Table 55.

Real Distinct Referring Train/Test
Images Objects Expressions Splits
19,894 96,654 130,525 Per-Image split
Table 55: Statistics of the RefCLEF dataset.

The RefCOCO3737 37 https://github.com/lichengunc/refer , RefCOCO+\footreffnote: refcoco-dataset-github-url (?), and RefCOCOg (?) datasets were all created using MSCOCO images. For RefCOCO and RefCOCO+, the “People vs. Object” split evaluates images containing multiple people (Test A) and images containing multiple instances of all other objects (Test B). Both RefCOCO and RefCOCO+ were collected in the same interactive setting as above, ReferItGame3838 38 http://tamaraberg.com/referitgame (?). Table 56 presents the statistics of the RefCOCO dataset whereas Table 57 shows the statistics of the RefCOCO+ dataset.

Total Referring Train/Test
Images Objects Expressions Splits
19,994 50,000 142,209 People vs. Object
Table 56: Statistics of the RefCOCO dataset.

One important distinction between the RefCOCO and RefCOCO+ datasets is that the latter was collected in a comparatively restrictive setting when compared to the former. Specifically, the usage of location words was not permitted in the referring expressions in case of RefCOCO+ whereas there was no such restriction on the language for RefCOCO.

Total Referring Train/Test
Images Objects Expressions Splits
19,992 49,856 141,564 People vs. Object
Table 57: Statistics of the RefCOCO+ dataset.

To overcome some of the limitations of RefCLEF, a dataset based on based on MSCOCO\footreffnote: mscoco-dataset-url was created. This dataset, known as RefCOCOg3939 39 https://github.com/mjhucla/Google_Refexp_toolbox (?), contains much longer sentences and was collected in a non-interactive setting using AMT, in contrast to the interactive setting used with RefCLEF, RefCOCO, and RefCOCO+. The statistics of this dataset is presented in Table 58.

Total Referring Train/Test
Images Objects Expressions Splits
26,711 54,822 85,474 Per-Object
Table 58: Statistics of the RefCOCOg dataset

Earlier mentioned referring expression datasets use single sentences for image referring expression. In contrast, the GuessWhat4040 40 https://github.com/GuessWhatGame/guesswhat  (?) dataset was created with a cooperative two-player guessing game, the goal of which was to locate an unknown object in an image (collected from MSCOCO) by asking a sequence of questions. Hence, it creates multiple sentences (i.e., a dialog) for a given image in order to perform referring expression. Another notable aspect of this dataset is that only images containing a number of objects in the range of 3 to 20 are chosen from MSCOCO. The dialogue collection was achieved via crowdsourcing using AMT. For evaluation, the dataset is randomly split into 70% for training, 15% for validation, and 15% for testing. Table 59 presents more details about the dataset.

Dataset Type Images Objects Dialogues Questions Words Vocab. Size
Full 66,537 134,073 155,280 821,889 3,986,192 11,465
Finished 65,112 125,349 144,434 732,081 3,540,497 10,985
Success 62,954 114,271 131,394 648,493 3,125,219 10,469
Table 59: Statistics of “GuessWhat” dataset. The row ‘Full’ means all the dialogues are included, ‘Finished’ means all finished dialogues (successful and unsuccessful) are included, and ‘Success’ means only successful dialogues are included.
Synthetic Images.

In the synthetic category, the CLEVR-Ref+4141 41 https://cs.jhu.edu/~cxliu/2019/clevr-ref+.html  (?) dataset was introduced to address issues such as bias in datasets with real images, since it has been recently been shown that referring expression models suffer from unintended biases (?). CLEVR-Ref+ reuses the images from the CLEVR dataset (see Section 5.2.2), while replacing the questions in CLEVR with referring expressions and answers with referred objects. The main purpose of CLEVR-Ref+ is to diagnose image reasoning with referring expressions by exercising the desired control over the nature of samples. Table 60 present splits of the dataset.

Split Images Referring Expressions
Training 70,000 700,000
Validation 15,000 150,000
Test 15,000 150,000
Table 60: Splits of the CLEVR-Ref+ dataset.

4.1.3 Image Referring Expression - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different Image Referring Expression models and the results achieved by them.

Evaluation Measures.

The measure that is usually used for the evaluation of Image Referring Expression models is [email protected], i.e., precision calculated with the Intersection over Union (IoU) ratio between the true and predicted bounding box.

Models.

The models designed to approach the task of Image Referring Expression provide an effective way to optimize the [email protected] measure by identifying the right object in a visual input which matches the textual phrase. In Table 61, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both image and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Approach Image Language Combined Optimizer RL
 (?) VGG LSTM MMI SGD
 (?) VGG LSTM Neg. Bag SGD
 (?) VGG LSTM Context -
 (?) VGG BiLSTM CG ADAM
 (?) VGG LSTM Combined ADAM
 (?) VGG LSTM CMN -
 (?) VGG LSTM Reinforcer ADAM
 (?) VGG BiLSTM VarContext SGD
 (?) VGG LSTM AccumulateAtt SGD
 (?) VGG LSTM ParallelAtt ADAM
 (?) ResNet-101 BiLSTM MAttNet -
 (?) ResNet-101 BiLSTM RVG-Tree ADAM
 (?) ResNet-101 BiLSTM CMRIN ADAM
Table 61: Exemplar Image Referring Expression and Comprehension architectures.
Results.

Several models and datasets have been created to address the task of Image Referring Expression. These datasets provide variety in the content so that they enhance the generalization ability of the models. In this section, we cover the results obtained by the models on some representative datasets. Table 62 and Table 63 presents results obtained with a subset of models built using the datasets such as RefCOCO, RefCOCO+, and RefCOCOg presented in Section 4.1.2.

RefCOCO
Model val testA testB
MMI (?) - 63.15 64.21
Neg. Bag (?) 76.90 75.60 78.00
Context (?) 76.18 74.39 77.30
CG (?) - 74.04 73.43
Attributes (?) - 78.85 78.07
CMN (?) - 75.94 79.57
Reinforcer (?) 79.56 78.95 80.22
VarContext (?) - 78.98 82.39
AccumulateAtt (?) 81.27 81.17 80.01
ParallelAtt (?) 81.67 80.81 81.32
MAttNet+ResNet-101 (?) 85.65 85.26 84.57
RVG-Tree+ResNet-101 (?) 83.48 82.52 82.90
CMRIN+ResNet-101 (?) 86.99 87.63 84.73

Table 62: Comparison of [email protected] (%) scores of different methods on the RefCOCO dataset.
RefCOCO+ RefCOCOg
Model val testA testB val test
MMI (?) - 48.73 42.13 - -
Neg Bag (?) - - - - 68.40
Context (?) 58.94 61.29 56.24 - -
CG (?) - 60.26 55.03 - -
Attributes (?) - 61.47 57.22 - -
CMN (?) - 59.29 59.34 - -
Reinforcer (?) 62.26 64.60 59.62 71.65 71.92
VariationalContext (?) - 62.56 62.90 - -
AccumulateAttn (?) 65.56 68.76 60.63 - -
ParallelAttn (?) 64.18 66.31 61.46 - -
MAttNet+ResNet-101 (?) 71.01 75.13 66.17 78.10 78.12
RVG-Tree+ResNet-101 (?) 68.86 70.21 65.49 76.82 75.20
CMRIN+ResNet-101 (?) 75.52 80.93 68.99 80.45 80.66
Table 63: Comparison of [email protected] (%) scores of different methods on the RefCOCO+ and RefCOCOg datasets.

4.1.4 Image Referring Expression - Discussion

For Image Referring Expression, on all MSCOCO based datasets (i.e., RefCOCO, RefCOCO+, and RefCOCOg) the technique proposed by ? (?) outperforms existing baselines. This approach builds a Cross-Modal Relationship Extractor (CMRE) to highlight objects and their relationships. Furthermore, a Gated Graph Convolutional Network (GGCN) is used to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information. This Cross-Modal Relationship Inference Network (CMRIN) along with ResNet-101 visual features have been shown to achieve the best results.

4.2 Video Referring Expression

In the following, we present more details about the Visual Referring Expression task in which a video is used as the visual input.

4.2.1 Video Referring Expression - Introduction

When compared to image referring expression, the task of video referring expression is less explored. There has been a surge in interest in tackling the spatio-temporal contexts and motion features that are inherent to videos. However, most of the work has thus far been concentrated only on one variant of image referring expression, i.e., comprehension. ? (?) used stereo videos to exploit richer and more realistic temporal-spatial contextual information along with gaze cues for referring expression comprehension. Figure 7 shows an example of the video referring expression comprehension.

Figure 7: Given a video (represented as a sequence of frames from  ? (?)) and a referring expression, the Referring Expression Comprehension Model identifies it in the video using bounding boxes.

Another approach by ? (?) explored Language Referring Expressions to point to the objects in the video to achieve object segmentation. Slightly different from the described task, ? (?) proposed an end-to-end boundary-aware model for video grounding. The model uses a lightweight branch to predict semantic boundaries corresponding to the given linguistic information. It aggregates contextual information by explicitly modeling the relationship between the current element and its neighbours.

4.2.2 Video Referring Expression - Datasets

In this section, we present the datasets used to evaluate the task of Video Referring Expression.

Object Referring in videos with Gaze (ORGaze).

For performing Video Referring Expression, the Cityscapes4242 42 https://www.cityscapes-dataset.com dataset containing a diverse set of stereo video sequences recorded in street scenes is modified to have gaze information. Therefore, ORGaze4343 43 https://people.ee.ethz.ch/~arunv/ORGaze.html  (?) contains object referring in videos with language and human gaze. More details of the dataset is presented in Table 64.

Videos Objects Condition Lighting Annotations
Bounding Boxes
5,000 30,000 Urban Daytime Gaze Recordings
Language Expression
Table 64: Statistics of the ORGaze dataset

The authors split the cities in the training set of Cityscapes for training and validation while using all the cities in validation set of Cityscapes for testing purposes. More concretely, the validation set is constructed by selecting one city (e.g., Zürich) from Cityscapes training set while leaving the rest of the cities as part of the training set. For constructing the test set, the videos from all the cities in Cityscapes validation set (e.g., Frankfurt, Lindau, Münster) of Cityscapes are used. Of the total 30,000 annotated objects, 80% has been used for training and the remaining 20% was reserved for model evaluation of the task.

4.2.3 Video Referring Expression - Evaluation Measures, Models, and Results

In this section, we review the evaluation measures used to benchmark different Video Referring Expression models and the results achieved by them.

Evaluation Measures.

The measure that is used for the evaluation of Video Referring Expression model is “Top-1 Accuracy” and also object proposal accuracy referred with Language-based Object Proposals (LOP), Faster R-CNN (FRCNN), and EdgeBox (?).

Models.

Many models have been created to solve the task of Video Referring Expression. In Table 65, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both video and language. We also include a column that showcases the optimization techniques used to train those models.

Approach Video Frame Language Combined Optimizer RL
 (?) - VGG LSTM WithGaze -
Table 65: Exemplar Video Referring Expression and Comprehension architectures.
Results.

As discussed earlier, several models have been created to approach the task of Video Referring Expression. In Table 66 we present results obtained with a subset of models built using the datasets presented earlier in Section 4.2.2.

Methods Edgebox FRCNN LOP
MNLM (?) - 23.954 32.418
VSEM (?) - 24.833 32.961
MCB (?) - 26.445 33.366
SimModel (?) 4.5 18.431 35.556
WithGaze (?) - 47.256 47.012
Table 66: Comparison of Top-1 Accuracy (%) of different methods on the ORGaze dataset.

4.2.4 Video Referring Expression - Discussion

The Video Referring Expression task is benchmarked using a single dataset. Evaluated using different task-specific metrics, the approach proposed by ? (?) which uses gaze information produces the best results.

5 Visual Question Answering, Reasoning, and Entailment

In this section, we explore three different tasks, namely, Visual Question Answering, Visual Reasoning, and Visual Entailment. The goal of each of these tasks are different, however they share the common intention of answering questions when conditioned on visual input. In the following, we elaborate each of these two tasks separately.

5.1 Visual Question Answering

The goal of Visual Question Answering (VQA) is to learn a model that comprehends visual content at both the global and local level for finding an association with pairs of questions and answers in the natural language form. The visual information for VQA includes both images and videos.

5.1.1 Image Question Answering - Introduction

The aim of Image Question Answering (Image Q&A) is to answer natural language questions about the contents of images. Earlier research efforts have focused on designing different algorithms and constructing datasets to address this challenge. The pioneering works (?, ?, ?) considered Image Q&A as a Visual Turing Test, where the expectation was to incorporate human-level abilities for semantically accessing the visual information to answer different questions. These were then improved as fill-in-the-blank tasks (?), where the goal of the system was focused on multiple-choice question-answering for images. Also, it was expanded to address both multilingual (?) and automatic question generation, in which descriptions of sentences are converted into questions (?). However, it lacked natural language questioning ability of humans. Hence, a broader task was proposed with an aim of addressing open-ended Image Q&A (?, ?), where the challenge was to ask a free-form natural language question about an image and make the system to answer the question. Figure 8 provides a schematic representation of the task where a free-form question about the contents of an image is asked to obtain an answer.

Figure 8: Given an image and a question about the image, the Image Question Answering model produces an answer to it.

However, designing such a system can contain several other challenges, such as coming up with strong baselines (?). To address these, binary image Q&A (?) was explored by providing complementary images for abstract scenes. These complementary images were used to provide visual verification of concepts contained in the questions. Some of the questions were understood as a loose, global association between Q&A sentences and images. Hence, more confined and dedicated tasks were created for relating local regions in the images (?) by addressing object-level grounding. Some approaches (?) concentrated only on counting objects in natural images. There are many methods that are proposed to address the challenging image Q&A task. The details about different methods are already covered in earlier surveys (?, ?). Therefore, we briefly present new methods that were introduced after the publication of those surveys. Recent works aim at interpretability or explainability by overcoming priors (?), concentrating better on the image to extract relevant information (?), generating human-interpretable rules that provide better insights (?), and cycle-consistency (?), while other works try to understand the text inside an image to answer and reason about it (?). More recent works sought to incorporate outside knowledge (?) in the image Q&A framework to support real-world knowledge-aware question answering (?). There are different kinds of learning approaches used for image Q&A, such as Multi-task learning and Federated learning. A multi-task learning approach (?) is used to learn a vision-language representation that is shared by many tasks from their diverse datasets to address image Q&A. In contrast, federated learning is used with the aimNet (?) and is validated on federated learning settings that include both horizontal and vertical federated learning. To focus on language priors, a modular language attention mechanism is used by ? (?) to parse a question into three phrase representations, namely type representation, object representation, and concept representation. It has prevented language priors from dominating the answering process.

5.1.2 Image Question Answering - Datasets

Several datasets were created in the past decade to address the challenge of image question answering. In the following, we cover the datasets that are extensively used for this challenging task.

VQA v1.0.

VQA v1.04444 44 https://visualqa.org (?) contains open-ended questions about images. These questions target different areas of an image, including background details and the underlying contexts. The answers are also open-ended and contain either a few words or a closed set of answers that can be provided in a multiple-choice format. Table 67 and Table 68 present the dataset splits of images with real and abstract scenes observed in the dataset respectively.

Dataset Real Questions Answers Textual Annotations
Split Scenes per Image per Question Questions Answers
Training 82,783 3 10 248,349 2,483,490
Validation 40,504 3 10 121,512 1,215,120
Test 81,434 3 10 244,302 2,443,020
Table 67: Splits of the VQA v1.0 dataset with real scenes.
Dataset Abstract Questions Answers Textual Annotations
Split Scenes per Image per Question Questions Answers
Training 20,000 3 10 60,000 600,000
Validation 10,000 3 10 30,000 300,000
Test 20,000 3 10 60,000 600,000
Table 68: Splits of the VQA v1.0 dataset with abstract scenes.
VQA v2.0.

VQA v2.0 extends VQA v1.0 and has three parts: Balanced Real Images, Balanced Binary Abstract Scenes, and Abstract Scenes. Table 69 and Table 70 presents the dataset splits of the images with balanced real and binary abstract scenes observed in the dataset respectively. However, abstract scenes in VQA v2.0 are same as that of VQA v1.0.

Dataset Real Answers Textual Annotations
Split Images per Question Questions Answers Complementary Pairs
Training 82,783 10 443,757 4,437,570 200,394
Validation 40,504 10 214,354 2,143,540 95,144
Test 81,434 10 447,793 4,477,930 -
Table 69: Splits of the VQA v2.0 dataset with balanced real images.

The term complementary pairs in Table 69 means that a given question is associated with a pair of similar images such that the answer is different depending on the image (i.e. two different answers)

Dataset Binary Abstract Answers Textual Annotations
Split Scenes per Question Questions Answers
Training 20,629 10 22,055 220,550
Validation 10,696 10 11,328 113,280
Table 70: Splits of VQA v2.0 with balanced binary abstract scenes.
Outside Knowledge VQA (OK-VQA).

OK-VQA4545 45 https://okvqa.allenai.org (?) uses a subset of MSCOCO (see Section 3.1.2) and is constructed with additional annotations such as questions, answers, knowledge category, etc. Table 71 presents more details about the dataset, while the Table 72 shows the splits of it.

Total Total Answers per Unique Unique Unique Total Average
Images Questions Question Questions Answers Ques. Words Categories Ans. Length
14,031 14,055 5 12,591 14,454 7,178 10 + 1 1.3
Table 71: Statistics of the OK-VQA dataset.
Split Percent (%) Questions
Training 64 9,009
Test 36 5,046
Total 100 14,055
Table 72: Splits of the OK-VQA dataset.
Knowledge-aware VQA (KVQA).

The KVQA4646 46 http://malllabiisc.github.io/resources/kvqa (?) dataset was designed to emphasize questions that require access to external knowledge. Table 73 presents more details about the dataset, while Table 72 shows the splits of it. In order to get a mean score, the KVQA dataset provides five such splits.

Total Q&A Unique Unique Avg. Avg. Avg. number of
Images Pairs Named Entities Answers Ques. Len Ans. Len Questions per Image
24,602 183,007 18,880 19,571 10.14 1.64 7.44
Table 73: Statistics of the KVQA dataset.
Split Percent (%) Images Q&A pairs
Training 70 17k 130k
Validation 20 5k 34k
Test 10 2k 19k
Table 74: Splits of the KVQA dataset.

5.1.3 Image Question Answering - Evaluation Measures, Models, and Results

In this section we describe only the evaluation measures used for Image Question Answering as Models, Results, and some Discussion are extensively presented in the recent surveys (?).

Evaluation Measures

For evaluating Image Q&A models, the Accuracy measure is used.

5.1.4 Video Question Answering - Introduction

The goal of Video Question Answering (Video Q&A) is to answer natural language questions about videos. Unlike Image Q&A, Video Q&A is less explored. Nevertheless, there are a few works which have explored this spatio-temporal domain. One of the early attempts in this domain was jointly parsing the videos with corresponding text to answer queries (?). Further, an open-ended Movie Q&A (?) with multiple-choice question pairs was designed to solve challenging questions that require semantic reasoning over a long temporal domain. Additionally, to limit the involvement of crowdworkers, the task was modified using fill-in-the-blank questions (?, ?) and were automatically generated from different manually created video description datasets (Section 3.1.5). Other works (?) modified this dataset to support answering free-form natural language questions. Beyond this, open-ended video question answering is also addressed with methods such as spatio-temporal attentional encoder-decoder learning framework (?). There has been interest shown in jointly addressing multiple tasks that handle video and language. High-level concept words (?) are detected in order to be integrated with any video and language models addressing fill-in-the blank and multiple-choice test. Spatio-temporal reasoning from videos to answer questions has also been addressed by designing a spatial and temporal attention mechanism (?). Recently, due to large interest in Video Q&A, similar to Movie Q&A, six popular TV shows were used to create a dataset, where questions are compositional (?). The TV Q&A dataset made the proposed multi-stream models to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and then recognize relevant visual concepts. Furthermore, spatio-temporal grounding (?) is employed to link depicted objects to visual concepts in questions and answers. Figure 9 gives an example of this task, in which the model is given a video and a question and is asked to choose an answer from multiple choices.

Figure 9: Given a video (represented as sequence of frames from TV Q&A dataset) and question, the Video Question Answering model finds the right answer from Multiple Options.

5.1.5 Video Question Answering - Datasets

Similar to image question answering, several datasets were created to address the challenge of video question answering. In the following, we cover those datasets that are popular and extensively used.

MovieQA.

The MovieQA4747 47 http://movieqa.cs.toronto.edu/home (?) dataset is used to evaluate story comprehension of both video and text in an automatic manner. The dataset consists of almost 15,000 multiple choice questions and answers obtained from over 400 movies having high diversity. Table 75 reports the statistics and splits of the dataset.

Training Validation Test Total
Movies with Plots and Subtitles
Movies 269 56 83 408
QA pairs 9848 1958 3138 14944
Q words 9.3 9.3 9.5 9.3 ± 3.5
CA. words 5.7 5.4 5.4 5.6 ± 4.1
Movies with Video Clips
Movies 93 21 26 140
QA pairs 4318 886 1258 6462
Video clips 4385 1098 1288 6771
Mean clip Length 201.0 s 198.5 s 211.4s 202.7 ± 216.2 s
Mean QA shots 45.6 49.0 46.6 46.3 ± 57.1
Table 75: Statistics & Splits of the MovieQA dataset. The column ‘Total’ represents mean counts with standard deviations.
TVQA.

The TVQA4848 48 http://tvqa.cs.unc.edu (?) dataset was created from videos of six different English TV shows, viz. Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey’s Anatomy, and Castle. It consists of 460 hours of video and the questions are designed to be compositional, expecting the models to comprehend subtitles-based dialogue and to recognize relevant visual concepts. Table 76 presents the statistics of the dataset, while Table 77 shows the splits.

Video Video Clip Q&A Total Questions per Answers per
Clips Length Pairs Duration Video Clip Video Clip
21,793 60 to 90 s 152,545 460 h 7 5
Table 76: Statistics of the TVQA dataset.

The testing data of TVQA is further split into two subsets named “test-public” containing 7,623 Q&A pairs and “test-reserved” consisting of 7,630 Q&A pairs. The test-public set is available for the TVQA leaderboard4949 49 http://tvqa.cs.unc.edu/leaderboard.html whereas test-reserved is preserved for future use.

Split Percent (%) Q&A pairs
Training 80 122,039
Validation 10 15,253
Test 10 15,253
Table 77: Splits of the TVQA dataset.

The TVQA+5050 50 http://tvqa.cs.unc.edu/download_tvqa_plus.html  (?) is an augmented subset of the original TVQA dataset where the augmentation comes in the form of bounding boxes linking depicted objects to visual concepts in both questions and answers. Table 78 presents the splits of TVQA+ dataset.

Avg. Span Avg. Video Annotated Bound.
Split Q&As Clips Length (s) Length (s) Images Boxes Categories
Training 23,545 3,364 7.20 61.49 118,930 249,236 2,281
Validation 3,017 431 7.26 61.48 15,350 32,682 769
Test 2,821 403 7.18 61.48 14,188 28,908 680
Total 29,383 4,198 7.20 61.49 148,468 310,826 2,527
Table 78: Splits of the TVQA+ dataset.

5.1.6 Video Question Answering - Evaluation Measures, Models and Results

In this section, we present the evaluation measures, models, and results achieved with various architectures of Video Q&A.

Evaluation Measures.

The Accuracy measure is used to evaluate the models of Video Q&A. Additionally, other measures such as Temporal mean Intersection-over-Union (Temp. mIoU) (?), Answer-Span joint Accuracy (ASA), that jointly evaluates both answer prediction and span prediction, and object grounding performance calculated with mean Average Precision (Grd. mAP) (?) are used.

Models.

The models which are created to address the task of Video Question Answering aim to provide overall understanding of the visual and the aligned textual content such as subtitles. In Table 79, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both video and language. We also include a column that showcases the optimization techniques used to train those models.

Approach Video Frame Language Combined Optimizer RL
(?) C3D ResNet-152 LSTM ST-VQA ADAM
(?) - R-CNN+ResNet-101 BiLSTM Two-stream -
(?) - R-CNN+ResNet-101 BERT STAGE ADAM
Table 79: Exemplar Video Question Answering architectures.
Results.

Several models have been created to approach the task of Video Question Answering. In addition, many datasets have also been created to provide diversity in the content so that they boost the generalization ability of the models. In this section, we cover the results achieved by the models on some representative datasets. Table 80 and Table 81 presents results obtained with a subset of models built using the TVQA and TVQA+ datasets presented in Section 5.1.5. Results for TVQA5151 51 http://tvqa.cs.unc.edu/leaderboard.html and TVQA+5252 52 https://competitions.codalab.org/competitions/22705#results can also be found on the respective leaderboards.

Model Accuracy
Random 20.00
Retrieval-SkipThought 24.77
Longest Answer 30.22
NNS-SkipThought (Subtitle) 38.29
NNS-TFIDF (Subtitle) 50.79
Two-stream (Subtitle+Videos) (?) 66.36
Three-stream (Subtitle+Videos+Questions) (?) 68.48
Table 80: Accuracy attained on TVQA test (public) set. All models use timestamp annotation without which the scores achieved by them are lower.
Model Accuracy Grd. mAP Temp. mIOU ASA
ST-VQA (?) 48.28 - - -
Two-stream (?) 68.13 - - -
STAGE-LXMERT (?) 71.46 21.01 26.31 18.04
STAGE (?) 74.83 27.34 32.49 22.23
Human (?) 90.46 - - -
Table 81: Results obtained on TVQA+ test set.

5.1.7 Video Question Answering - Discussion

It has been observed from STAGE (?) that aligned fusion is essential for improving Video Q&A performance. STAGE uses all of the existing information such as Subtitles, Video, and Questions to build an efficient model. It has also proven to be effective if the models have access to the timestamp information as shown in Table 80.

5.2 Visual Reasoning

The goal of visual reasoning is to learn a model that comprehends the visual content by reasoning about it. Both images and videos are used as visual inputs for visual reasoning. In the following, we present more details about this complex and challenging task.

5.2.1 Image Reasoning - Introduction

The goal of image reasoning is to answer sophisticated queries by reasoning about the visual world. Initial efforts (?) aimed at designing diagnostic tests going beyond benchmarks such as VQA. They reduced the biases by having detailed annotations describing the kind of reasoning each question requires. It has also been observed that VQA models struggle when comparing the attributes of objects, or when novel attribute combinations needs to be recognized (such as in compositional reasoning). A novel approach (?) used a program generator to construct an explicit representation of the reasoning process, and an execution engine to execute the resulting program, producing an answer. Then, end-to-end module networks (?) were proposed which learn to reason by directly predicting instance-specific network layouts without the aid of a parser as used in neural module networks. ? (?) went beyond and proposed Relation Networks (RNs) as a simple plug-and-play module to solve the problem of visual reasoning. RNs are further used to learn relation-aware visual features for content based image retrieval (?) and also Multi-Relational Networks (?). Furthermore, global context reasoning (?) is explored for better aligning image and language domains in diverse and unrestricted cases. A recent approach (?) introduced a general-purpose conditioning method called Feature-wise Linear Modulation (FiLM) layers which influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. FiLM was modified by ? (?) to generate parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once. Cascaded Mutual Modulation (CMM) (?) is an end-to-end visual reasoning model that also uses the FiLM technique to enable the textual/visual pipeline to mutually control each other. Another approach modified neural modular networks (?) such that it performs compositional reasoning by automatically inducing a desired sub-task decomposition without relying on strong supervision. ? (?) proposed a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly interpretable manner. Also, in the context of interpretable learning frameworks, Learning-By-Asking (LBA) (?) attempted to closely mimic natural learning with the goal to make it more data efficient than the traditional VQA setting. Further, compositional attention networks (?) were designed as fully differentiable neural network architectures to facilitate explicit and expressive reasoning. The goal of this architecture is to provide a strong prior for iterative reasoning, allowing it to support structured learning, as well as to generalize from a modest amount of data. Recently, neural-symbolic visual question answering (?) attempted to combine deep representation learning with symbolic program execution. It first recovers structural scene representation from the image and a program trace from the question. This was extended with a Neuro-Symbolic Concept Learner (NS-CL) (?) that learns visual concepts, words, and semantic parsing of sentences without explicit supervision. It learns by simply looking at images and reading paired questions and answers. Further, a multimodal relational network (MuRel) (?) was proposed to learn end-to-end reasoning over real images. Additionally, ? (?) used spatial knowledge to aid visual reasoning. Their framework combined knowledge distillation, relational reasoning, and probabilistic logical languages. Existing diagnostic tests have been further modified with referring expressions to handle bias (?) and with structural, relational, and analogical reasoning in a hierarchical representation (?). Explainable and explicit neural modules (?) have also been explored with scene graphs. Objects as nodes and pairwise relationships as edges were used for explainable and explicit reasoning with structured knowledge. Further expanding the scope of inquiry on this subject, ? (?, ?) exploit the compositional linguistic structure of complex questions by forming neural module networks which query about the abstract shapes observed in an image. Improvement is further seen in how images are interpreted. For example, compositional question answering (?) was addressed with scene graph structures on real-world images going beyond abstract shapes. Figure 10 demonstrates the task of reasoning about real-world images.

Figure 10: Given a real-world image and a question, the Image Reasoning Model reasons about the question to produce an answer.

Reasoning was also extended to cognition for understanding the information observed in images with commonsense reasoning (?), while the goal of NLVR (?) and NLVR2 (?) tasks is to determine whether a sentence is true about a visual input or not.

5.2.2 Image Reasoning - Datasets

For image reasoning, both real and synthetic image datasets have been developed. In the following, we present the datasets belonging to both of these two categories.

Compositional Language and Elementary Visual Reasoning (CLEVR).

CLEVR5353 53 https://cs.stanford.edu/people/jcjohns/clevr (?) is a diagnostic dataset created using a 3D computer graphics toolkit known as Blender5454 54 https://www.blender.org. It consists of synthetic images of simple 3D objects that vary in their attributes, viz. size, color, shape, and material. Images contain three to ten different combinations of these objects and attributes and are arranged in different spatial positions. Such complex configurations require good visual reasoning capabilities from VQA models to produce correct answers. Table 82 presents the splits of dataset.

Split Images Questions Unique Questions Overlap with train
Training 70,000 699,989 608,607 -
Validation 15,000 149,991 140,448 17,338
Test 15,000 149,988 140,352 17,335
Total 100,000 999,968 853,554 -
Table 82: Splits of the CLEVR dataset.
Natural Language Visual Reasoning (NLVR).

Cornell Natural Language for Visual Reasoning dubbed as NLVR5555 55 http://lil.nlp.cornell.edu/nlvr  (?) is a multimodal dataset that comes with natural language sentences grounded in synthetic images. The images are rendered and encapsulate different objects such as triangles, circles, and squares. These objects come in various sizes and are placed at different positions within images. The descriptions of the images were manually written by crowdworkers. Table 83 presents the official splits of the dataset for evaluation purposes.

Split Unique Sentences Examples
Training 3,163 74,460
Validation 267 5,940
Test-P 266 5,934
Test-U 266 5,910
Total 3,962 92,244
Table 83: Splits of the NLVR dataset. Test-P and Test-U means Test set (public) and Test set (unreleased) respectively.
Natural Language Visual Reasoning for Real (NLVR2).

The limitations such as limited expressivity and semantic diversity that arose due to the synthetic nature of the NLVR dataset, has been addressed in the next incarnation of NLVR named as Natural Language for Visual Reasoning for Real, NLVR2\footreffnote:nlvr-dataset-url (?). Similar to NLVR, the images in NLVR2 also come as a pair along with a grounded natural language description. Table 84 presents the official splits of the dataset.

Split Unique Sentences Examples
Training 23,671 86,373
Validation 2,018 6,982
Test-P 1,995 6,967
Test-U 1,996 6,970
Total 29,680 107,292
Table 84: Splits of the NLVR2 dataset. Test-P denotes Test set Public, whereas Test-U means Test set Unreleased.
CLEVR-CoGenT.

A modified version of CLEVR is Compositional Generalization Test (CLEVR-CoGenT)\footreffnote: clevr-dataset-url (?). It is used to test models’ ability to find novel combinations of attributes at test-time. There are two types of conditions in this dataset, viz. Condition A and Condition B, where based on the condition, the color of the geometrical shape can vary as show in Table 85. Based on these conditions, the CLEVR-CoGenT dataset is divided for evaluation purposes as shown in Table 86.

Geometrical Shape Condition Colors of Geometrical Shape
Cubes A gray, blue, brown, yellow
B red, green, purple, cyan
Cylinders A red, green, purple, cyan
B gray, blue, brown, yellow
Spheres A any color
B any color
Table 85: Conditions in the CLEVR-CoGenT dataset.
Split Condition Images Questions
Training A 70,000 699,960
Validation A 15,000 150,000
B 15,000 149,991
Test B 15,000 149,980
B 15,000 149,992
Table 86: Splits of the CLEVR-CoGenT dataset.
GQA.

The GQA5656 56 https://cs.stanford.edu/people/dorarad/gqa  (?) dataset was created to address the shortcomings in earlier VQA datasets. GQA consists of compositional questions over real-world images. Each image is associated with a scene graph of the image’s objects, attributes, and relations. Also, each question is associated with a structured representation of its semantics. Table 87 presents the statistics and splits of the dataset.

Images Questions Vocabulary Size Training Validation Testing Challenge
113,018 22,669,678 3,097 70% 10% 10% 10%
Table 87: Statistics & splits of the GQA dataset.
Relational and Analogical Visual rEasoNing (RAVEN).

The RAVEN5757 57 http://wellyzhang.github.io/project/raven.html  (?) dataset was designed to perform relational and analogical visual reasoning. It is built by keeping in mind Raven’s Progressive Matrices (RPM) (?). Furthermore, it associates vision with structural, relational, and analogical reasoning in a hierarchical representation. The dataset is split into training, validation, and testing in the ratio 6:2:2 respectively. Table 88 presents the statistics of the dataset.

RPM Tree-structure Structural Rule Avg. rules
Images Problems per problem Labels Annotations per problem
1,120,000 70,000 16 1,120,000 440, 000 6.29
Table 88: Statistics of the RAVEN dataset.
Visual Commonsense Reasoning (VCR).

VCR5858 58 https://visualcommonsense.com  (?) is a large-scale dataset for achieving cognition-level visual understanding. It contains about 110k images, 290k multiple choice questions and correspondingly 290k correct answers and rationales. This dataset is very diverse and, consequently, it is challenging. Table 89 presents the official splits and some high-level statistics of the dataset.

Dataset Characteristic Train Validation Test
Number of questions 212,923 26,534 25,263
Number of answers per question 4 4 4
Number of rationales per question 4 4 4
Number of images 80,418 9,929 9,557
Number of movies covered 1,945 244 189
Average question length 6.61 6.63 6.58
Average answer length 7.54 7.65 7.55
Average rationale length 16.16 16.19 16.07
Average num. of objects mentioned 1.84 1.85 1.82
Table 89: High-level statistics of the VCR dataset. One fold in the dataset was held-out for blind evaluation at a later date. Hence, the statistics of that fold are not shown here.
Visual COMmonsense rEasoning in Time (Visual COMET).

Visual COMET5959 59 https://visualcomet.xyz (?) is a large-scale dataset of Visual Commonsense Graphs for reasoning about the dynamic context of static images in order to achieve cognitive visual scene understanding. VisualCOMET contains images with person grounding (i.e., multimodal co-reference chains) and the images are connected with inference sentences. Table 90 presents the official splits and more statistics about the dataset.

Images/ Events at Inferences on Total
Split Places Present Events Before Intents at Present Events After Inferences
Train 47,595 111,796 467,025 237,608 469,430 1,174,063
Dev 5,973 13,768 58,773 28,904 58,665 146,332
Test 5,968 13,813 58,413 28,568 58,323 145,309
Total 59,356 139,377 584,211 295,080 586,418 1,465,704
Table 90: Statistics and splits of the Visual Commonsense Graph dataset.

5.2.3 Image Reasoning - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Image Reasoning and the results obtained by them.

Evaluation Measures.

The standard evaluation measures such as Accuracy are used for evaluation. However, there are evaluation measures that are explicitly used for Image Reasoning (e.g., CLEVR), viz. Querying Attribute (QA) that uses questions to ask about an attribute of a particular object, Compare Attribute (CA) which uses comparison questions for asking whether two objects have the same value for some attribute, Compare Numbers (CN) which uses comparison questions to ask which of two object sets is larger, Count which asks counting questions to find the number of objects fulfilling some conditions, and Exist which asks existence questions to check whether a certain type of object is present or not.

Models.

The models that are designed to approach the task of Image Reasoning are built such that they provide an effective way of reasoning about vision with language as additional input. In Table 91, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both image and language. We also include a column that showcases the optimization techniques used to train the Image Reasoning models.

Approach Image Language Combined Optimizer RL
 (?) ResNet-101 LSTM SA+MLP ADAM
 (?) VGG LSTM N2NMN ADAM
 (?) ResNet-101 LSTM PGEE ADAM
 (?) Custom LSTM RN ADAM
 (?) ResNet-101 BiLSTM ACMN ADAM
 (?) ResNet-101 GRU FiLM ADAM
 (?) ResNet-101 BiLSTM MAC ADAM
 (?) ResNet-101 - TbD ADAM
 (?) ResNet-152 LSTM FinalDestGraph ADAM
 (?) ResNet-101 LSTM LCGN ADAM
 (?) ResNet-34 BiGRU NS-CL -
Table 91: Exemplar Image Reasoning architectures. “Custom” - Own CNN architecture.
Results.

The models designed on different Image Reasoning datasets aim to achieve generalization. In this section, we cover the results achieved by the models from some representative datasets. Table 92, Table 93, Table 94, and Table 95 presents results obtained with a subset of models built using the datasets such as CLEVR, GQA, VCR, and RAVEN that were presented in Section 5.2.2. Results for the NLVR and NLVR2 tasks can be found on the respective leaderboards.6060 60 http://lil.nlp.cornell.edu/nlvr/

Model Count Exist CN QA CA Overall
CNN+LSTM+SA+MLP (?) 59.7 77.9 75.1 80.9 70.8 73.2
N2NMN+700KProgLabel (?) 68.5 85.7 84.9 90.0 88.7 83.7
PGEE+700KProgLabel (?) 92.7 97.1 98.7 98.1 98.9 96.9
CNN+LSTM+RN (?) 90.1 97.8 93.6 97.9 97.1 95.5
ACMN (?) 94.2 81.3 81.6 90.5 97.1 89.3
CNN+GRU+FiLM (?) 94.3 99.1 96.8 99.1 99.1 97.7
MAC (?) 97.2 99.5 99.4 99.3 99.5 98.9
TbD+700KProgLabel (?) 97.6 99.2 99.4 99.5 99.6 99.1
FinalDestGraph (?) 91.3 98.6 99.6 99.5 99.8 97.5
LCGN+single-hop (?) - - - - - 97.9
NS-CL (?) 98.2 98.8 99.0 99.3 99.1 98.9
Table 92: Comparison of different models on the CLEVR dataset.
Model val test-dev test
CNN+LSTM (?) 49.2 - 46.6
Bottom-up (?) 52.2 - 49.7
MAC (?) 57.5 - 54.1
LCGN+single-hop (?) 63.8 55.6 56.0
Table 93: Comparison of accuracy (%) scores of different methods on the validation (val), test-dev, and test splits of the GQA dataset.
(Q A) (QA R) (Q AR)
Model val test val test val test
R2C (?) 63.8 65.1 67.2 67.3 43.1 44.0
ViLBERT (?) 72.4 73.3 74.5 74.6 54.0 54.8
B2T2 (?) 71.9 72.6 76.0 75.7 54.9 55.0
VL-BERT (?) 73.7 74.0 74.5 74.8 55.0 55.5
Unicoder-VL (?) 72.6 73.4 74.5 74.4 54.5 54.9
Table 94: Comparison of accuracy (%) scores of different models on the validation (val) and test splits of the VCR dataset.
2x2 3x3
Model Acc Grid Grid L-R U-D O-IC O-IG
WReNDRT (?) 15.02 23.26 29.51 6.99 8.43 8.93 12.35
ResNetDRT (?) 59.56 46.53 50.40 65.82 67.11 69.09 60.11
Human (?) 84.41 81.82 79.55 86.36 81.81 86.36 81.81
PerfectSolver 100 100 100 100 100 100 100
Table 95: Comparison of accuracy (%) scores of different models on the RAVEN dataset.

5.2.4 Image Reasoning - Discussion

The task of Image Reasoning has been studied using different types of datasets. Initially, a synthetic dataset, viz. CLEVR, was used. Later, real-world datasets like GQA were created for developing more complex vision and language integration models. Table 92 shows the results for the CLEVR dataset. Recently introduced Neuro-Symbolic Concept Learner (NS-CL) (?) reaches state-of-the-art results without explicit supervision on visual concepts, words, and semantic parsing of sentences. However, for the real-world image datasets like GQA, the approach by ? (?) that creates Language-Conditioned Graph Networks (LGCN) providing different hops to effectively support relational reasoning achieve best results. Most of the works that outperform on the VCR task are pretrained and fine-tuned as shown in Table 94. The RAVEN dataset differs from both CLEVR and GQA as it depends only on the image input. We can observe from Table 95 that a perfect solver achieves 100% accuracy, while the approach introduced by ? (?) achieves reasonable system performance.

5.2.5 Video Reasoning - Introduction

When compared to image reasoning, video reasoning is in its nascent stages and still there is no clearly defined goal. However, for video reasoning, a configurable visual question and answer (COG) (?) is designed to parallel experiments in humans and animals. The goal of COG is to address the problems related to visual and logical reasoning and memory. To be more concrete, the task is aimed at deducing the correct answer while taking into account the changes of the scene i.e., from both spatial and temporal perspective. Figure 11 demonstrates the task of temporal reasoning about synthetic 2D scenes resembling video input.

Figure 11: Given a video (represented as a sequence of synthetic 2D scenes (?)) and a question, the Video Reasoning Model reasons about the video to perform the task presented to it in the question.

Further,  ? (?) addressed both image and video reasoning by introducing the concept of a question-based visual guide to constrain the potential solution space by learning an optimal traversal scheme. In their approach, the final destination nodes alone are used to produce the answers.

5.2.6 Video Reasoning - Datasets

There are not many datasets for video reasoning. One of the few examples is listed below.

Configurable Visual Question and Answer (COG).

COG6161 61 https://github.com/google/cog#datasets  (?) was created to parallel experiments in humans and animals. Table 96 presents splits of the dataset.

Total Examples per
Split Examples Task Family
Training 10,000,320 227,280
Validation 500,016 11,364
Test 500,016 11,364
Table 96: Splits of the COG dataset.

5.2.7 Video Reasoning - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Video Reasoning and the results obtained by them.

Evaluation Measures.

For Video Reasoning (e.g., COG) the evaluation measures used are based on account changes of the scene in three different query types.

  • Pointing (Point) which uses questions to ask about pointing to a certain object.

  • Yes/No which uses questions seeking binary decision, Conditional (Condit) which is composed of questions based on objects that needs to fulfill certain conditions.

  • Attribute-related (Atts) which is composed of questions about certain attributes.

Models.

Many models have been created to approach the task of Video Reasoning. In Table 97, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both video and language. We also include a column that showcases the optimization techniques used to train those models.

Approach Video Frame Language Combined Optimizer RL
 (?) - Custom LSTM WorkMemory ADAM
 (?) - ResNet-152 LSTM FinalDestGraph ADAM
Table 97: Exemplar Video Reasoning architectures.
Results.

As discussed earlier several models have been created to approach the task of Video Reasoning. In Table 98 we present the results obtained with a subset of models built using the COG dataset presented in Section 5.2.6.

Model Atts Condit Point Yes/No All
WorkMemory (?) - - - - 93.7
QuestionNodes (?) 73.7 63.5 92.5 57.9 63.3
FinalDestGraph (?) 99.2 98.4 100.0 95.0 97.2
Table 98: Comparison of measures using different methods on the COG dataset.

5.2.8 Video Reasoning - Discussion

The results presented in Table 98 show that the recently proposed approach by ? (?) achieves the best results on different task-specific measures. This approach proposes a question-based visual guide, which constrains the potential solution space by learning an optimal traversal scheme.

5.3 Visual Entailment

The goal of Visual Entailment task is to learn a model that predicts whether the visual content entails the augmented text along with hypothesis. Both images and videos are used as visual inputs. In the following, we elaborate the task, datasets used, and the approaches proposed to tackle the problem.

5.3.1 Image Entailment - Introduction

Addressing the drawbacks of VQA and visual reasoning which deal with similar objects and sentence structures, ? (?) initially proposed a visually-grounded version of the Textual Entailment task where an image is augmented to textual premise and hypothesis. However, it was refined by  ? (?) to predict whether the image semantically entails the text, given image-sentence pairs, where the premise is defined by an image instead of a natural language sentence. Figure 12 summarizes the task, where the image as a premise and a piece of text as hypothesis are used by the Image Entailment model to predict whether the hypothesis is an entailment, contradiction, or neutral.

Figure 12: Given an image premise and a natural language text as hypothesis, the Image Entailment Model predicts whether the hypothesis is an entailment, contradiction, or neutral by understanding the evidence present in the image.

5.3.2 Image Entailment - Datasets

The image entailment is achieved using two different datasets. One dataset extends Natural Language Inference with Visually-grounded Natural Language Inference (V-SNLI) (?) while the other extends the Flickr30K dataset (see Section 3.1.2) into a visual entailment dataset (SNLI-VE)6262 62 https://github.com/necla-ml/SNLI-VE  (?). Table 99 and Table 100 presents the statistics and splits of these two datasets respectively.

Split Entailment Neutral Contradiction
Training 182,167 181,515 181,938
Validation 3,329 3,235 3,278
Test 3,368 3,219 3,237
V-SNLIhard Test 1,058 1,068 1,135
Table 99: Splits of the V-SNLI dataset.
Split Images Entailment Neutral Contradiction Vocab
Training 29,783 176,932 176,045 176,550 29,550
Validation 1000 5,959 5,960 5,939 6,576
Test 1000 5,973 5,964 5,964 6,592
Table 100: Splits of the SNLI-VE dataset.

5.3.3 Image Entailment - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Image Entailment and the results obtained by them.

Evaluation Measures.

The Accuracy measure is used to evaluate Image Entailment models.

Models.

Two different models are created to approach the task of Image Entailment. In Table 101, we present some exemplar architectures (refer to Combined column) created to address the task. We also include a column that showcases the optimization techniques used to train those models.

Approach Image Language Combined Optimizer RL
 (?) VGG BiLSTM V-BiMPM ADAM
 (?) ResNet-101 GRU EVE-Image ADAM
Table 101: Exemplar Image Entailment architectures.
Results.

The Image Entailment models leverage both image and textual input representations to build an entailment pipeline. In Table 102, Table 103, and Table 104 we present results obtained with a subset of models that were built using the datasets presented in Section 5.3.2.

Model Contradiction Neutral Entailment Overall
Relation Network (?) 67.29 68.86 66.50 67.55
Bottom-up (?) 70.52 70.96 65.23 68.90
Top-Down (?) 69.72 69.33 71.86 70.3
Hypothesis Only (?) 67.60 67.71 64.83 66.71
EVE-ROI (?) 67.69 69.45 74.25 70.47
EVE-Image (?) 71.56 70.52 71.39 71.16
Table 102: Comparison of accuracies (%) of different models on the SNLI-VE dataset.
Model Contradiction Neutral Entailment Overall
Hypothesis Only (?) 66.29 66.36 72.65 68.49
LSTM (blind) (?) 79.7 76.79 87.71 81.49
V-LSTM (?) 71.39 68.06 87.14 75.70
BiMPM (?) 86.25 82.79 90.03 86.41
V-BiMPM (?) 87.53 82.91 90.38 86.99
Table 103: Comparison of accuracies (%) of different models on the V-SNLI dataset.
Model Contradiction Neutral Entailment Overall
Hypothesis Only (?) 25.29 20.22 31.28 25.57
LSTM (blind) (?) 60.79 50.19 72.12 60.99
V-LSTM (?) 46.34 32.02 69.09 49.03
BiMPM (?) 77.62 59.36 80.43 72.55
V-BiMPM (?) 76.12 63.67 81.38 73.75
Table 104: Comparison of accuracy (%) scores of different models on V-SNLIhard.

5.3.4 Image Entailment - Discussion

The task of Image Entailment was evaluated using two different datasets. Table 103 and Table 104 shows results from V-SNLI in different settings. The approach proposed by ? (?) that creates a visually grounded Bilateral Multi-Perspective Matching (BiMPM) model achieves best results for the entailment. Similarly, evaluations conducted with SNLI-VE presented in Table 102 shows that the Explainable Visual Entailment (EVE) approach proposed by ? (?) achieves best results.

5.3.5 Video Entailment - Introduction

Video entailment (?) aims to infer whether the natural language hypothesis is entailed or contradicted when given a video clip aligned with the subtitles information. The video contains diverse temporal dynamics, event shifts, and social interactions. Figure 13 summarizes the task: given a video clip with aligned subtitles as premise and a natural language hypothesis based on the video content, a video entailment model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.

Figure 13: Given a video along with aligned subtitles as premise and a paired natural language text as hypothesis, the goal of a Video Entailment Model is to predict whether the hypothesis is an entailment or contradiction, by understanding the evidence(s) observed in the video. Example modified from ? (?).

5.3.6 Video Entailment - Datasets

The Video Entailment task is introduced by ? (?), with the introduction of a large-scale dataset called as VIdeO-and-Language INference (VIOLIN)6363 63 https://github.com/jimmy646/violin. Detailed statistics of the dataset is presented in Table 105.

Video Source Num. of Num. of Avg. Clip Avg. Pos. Avg. Neg. Avg. Sub-
(TV Show/Movie Clips) Episodes Clips Len Stmnt Len Stmnt Len Title Len
Friends 234 2,676 32.89s 17.94 17.85 72.80
Desperate Housewives 180 3,466 32.56s 17.79 17.81 69.19
How I Met Your Mother 207 1,944 31.64s 18.08 18.06 76.78
Modern Family 210 1,917 32.04s 18.52 18.20 98.50
MovieClips 5,885 5,885 40.00s 17.79 17.81 69.20
All 6,716 15,887 35.20s 18.10 18.04 76.40
Table 105: Statistics of different video sources in the VIOLIN dataset.

For training and model evaluation purposes, the VIOLIN dataset is split into training, validation, and test splits in the ratio of 8:1:1. The exact number of triplet instances in each of the splits is shown in Table 106.

Number of Number of Number of
Split Videos (V) Hypotheses (H) Triplets (V, S, H)
Training 12,687 76,122 76,122
Validation 1,600 9,600 9,600
Testing 1,600 9,600 9,600
Total 15,887 95,322 95,322
Table 106: Splits of the VIOLIN dataset

(V: Video, S: Subtitle, H: Hypothesis)

5.3.7 Video Entailment - Evaluation Measures, Models, and Results

In this section, we present the evaluation measures, models, and results achieved with various architectures introduced for solving the Video Entailment task.

Evaluation Measures.

The Accuracy measure is used to evaluate Video Entailment models.

Models.

Very few models have been created to approach the task of Video Entailment. The variation of the Video Entailment models include the usage of different type of textual content such as subtitles, statements, etc. In Table 107, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both video and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Approach Video Frame Language Combined Optimizer RL
(?) - Detection Feat BERT SSV ADAM
Table 107: Exemplar Video Entailment architectures. SSV - Statement+Subtitles+Visual.
Results.

Few models which have been designed to approach the task of Video Entailment use different types of textual content aligned with video. In Table 108 we present results obtained with subset of models built using the VIOLIN dataset presented in Section 5.3.6. For building textual or visual representations, models such as SSV has used pretrained vision and language integration models such as LXMERT (?).

Model Visual Text Accuracy
Statement (?) - BERT 54.20
Statement+Visual (?) Detection Feat BERT 59.45
Statement+Subtitles (?) - BERT 66.05
SSV (?) LXMERT LXMERT 66.25
SSV (?) Detection Feat BERT 67.84
Table 108: Comparison of accuracies (%) of different methods on the VIOLIN dataset.

5.3.8 Video Entailment - Discussion

The task of Video Entailment was evaluated using the VIOLIN dataset and the recently proposed method by ? (?) has shown that using multi-source information arising from different types of data such as Statements, Subtitles, and Visual features is useful for building a robust model. In addition, textual features generated using contextualized word embedding models are effective as well.

6 Visual Dialog

In this section, we explore the task of Visual Dialog. The objective of video dialog is different from previous tasks and involves a complex interaction between a human and an artificial agent.

6.1 Image Dialog

In the following, we present more details about Visual Dialog in which an image is used as the visual input.

6.1.1 Image Dialog - Introduction

The goal of the image dialog task is to create AI agents that can hold dialog with humans in a natural language of choice about a visual content (?) represented by an image. To be more specific, given an image, a history of dialogs, and a question about the image, the goal of the AI agent is to ground the question in the image, infer context from the history, and answer the question accurately. However, the problem can be observed from a different perspective where the goal of the system is to locate an unknown object in the image by asking a sequence of questions (?) or to hold natural-sounding conversations about a shared image (?). Figure 14 summarizes the task.

Figure 14: Given an image, question, and the dialog history, the Image Dialog Model generates an answer based on it.

Further, a standard agent can be extended to have a question and answer bot cooperating with each other for guessing images (?). To counter generic responses in dialog generation, knowledge transfer from dialog generation was explored with a discriminative dialog module trained to rank a list of candidate human responses (?). However, other approaches constrained themselves to specific domains and proposed end-to-end optimization schemes (?). ? (?) introduced attentive memory that exploits visual attention in the past to resolve the current reference. Recently, reinforcement learning and Generative Adversarial Networks (GANs) were also used to generate more human-like responses to questions in the image-based dialog (?). Dialog can also be seen from the perspective of a system which asks questions, and demonstrates how a visual dialog can be generated from discriminative question generation and answering (?). Furthermore, co-reference resolution was also investigated  (?) to bridge the gap between nouns and pronouns with the usage of modules that form explicit and grounded co-reference resolution at word-level. Recently, a novel attention mechanism called recursive visual attention (?) was proposed to resolve visual co-reference for visual dialog by browsing the dialog history. Another approach (?) formalized the task as inference in a graphical model with partially observed nodes and unknown graph structures, i.e., relations in dialog. Further,  ? (?) extended one-stage solution to a two-stage solution by building an image-question-answer synergistic network to value the role of the answer for precise visual dialog. Other novel approaches (?) were also designed where a visually-grounded encoder was employed to synergize between guessing and asking questions. Further, a cooperative learning regime was followed to improve the accuracy.

6.1.2 Image Dialog - Datasets

For addressing the task of image dialog several datasets have been created. In the following, we elaborate each of them separately.

VisDial.

For Image Dialog, there exists two versions of this dataset, VisDial v0.9 and VisDial 1.06464 64 https://visualdialog.org/data  (?). VisDial was created using the MSCOCO dataset. For VisDial v0.9, splits are divided only into the training and validation set. Table 109 and Table 110 present details about the splits of VisDial v0.9 and VisDial v1.0 respectively.

Split Images Questions Answers Dialog Turns
Training 82,783 827,830 827,830 10
Validation 40,504 405,040 405,040 10
Test - - - -
Table 109: Splits of the VisDial v0.9 dataset.
Split Images Questions Answers Dialog Turns
Training 123,287 1,232,870 1,232,870 10
Validation 2,064 20,640 20,640 10
Test 8,000 80,000 80,000 1
Table 110: Splits of the VisDial v1.0 dataset.
CLEVR-Dialog.

The CLEVR-Dialog6565 65 https://github.com/satwikkottur/clevr-dialog  (?) dataset was developed for studying multi-round reasoning in visual dialog. The dialog grammar is grounded in the scene graphs of the CLEVR dataset (Section 5.2.2), originally developed for reasoning about images. Table 111 provides statistics of the dataset, while Table 112 shows dataset splits.

CLEVR Total Total Unique Unique Vocabulary Dialog Mean Ques.
Images Dialogs Questions Questions Answers Size Turns Length
85k 425k 4.25M 73k 29 125 10 10.6
Table 111: Statistics of the CLEVR-Dialog dataset.
Split Images Q&A Pairs Instances Dialog Rounds
Training 70,000 3.5M 5 10
Validation 15,000 0.75M 5 10
Test - - - -
Table 112: Splits of the CLEVR-Dialog dataset.

6.1.3 Image Dialog - Evaluation Measures, Models and Results

In this section, we review the measures used to evaluate different models of Image Dialog and the results achieved by these models.

Evaluation Measures.

To evaluate the Image Dialog models, the Retrieval metrics presented in Section 3.1.3 are used.

Models.

The models created to approach the Image Dialog task continuously process a stream of image and textual dialog information. In Table 113, we present some exemplar architectures (refer to Combined column) designed to integrate both image and textual dialog to address the task. We also include a column that showcases the optimization techniques used to train such models.

Approach Image Language Combined Optimizer RL
 (?) VGG LSTM MemoryNetwork ADAM
 (?) VGG LSTM HCIAE-NP-ATT ADAM
 (?) VGG LSTM AMEM ADAM
 (?) VGG LSTM SF ADAM
 (?) ResNet-152 LSTM CorefNMN -
 (?) VGG LSTM CoAtt-GAN ADAM
 (?) ResNet-152 LSTM RvA ADAM
 (?) VGG LSTM GNN ADAM
 (?) ResNet-101 LSTM Synergistic ADAM
Table 113: Exemplar Image Dialog Architectures (Discriminative and Generative).
Results.

Models that are created to solve the task of Image Dialog goal is to build a system which comprehends the complexity of the task effectively. There are several approaches used to build the models with different versions of the same dataset. However, few approaches share some commonalities such as usage of Memory Networks (?). Table 114 and Table 115 presents the results obtained with a subset of both discriminative and generative models built using the “VisDial0.9” dataset. While Table 116 presents the results obtained only with a subset of generative models built using the “VisDial1.0” dataset presented earlier in Section 6.1.2.

Model MRR [email protected] [email protected] [email protected] Mean
LF (?) 0.5807 43.82 74.68 84.07 5.78
HRE (?) 0.5846 44.67 74.50 84.22 5.72
HREA (?) 0.5868 44.82 74.81 84.36 5.66
MN (?) 0.5965 45.55 76.22 85.37 5.46
HCIAE-NP-ATT (?) 0.6222 48.48 78.75 87.59 4.81
AMEM (?) 0.6227 48.53 78.66 87.43 4.86
CoAtt (?) 0.6398 50.29 80.71 88.81 4.47
SF (?) 0.6242 48.55 78.96 87.75 4.70
SCA (?) 0.6398 50.29 80.71 88.81 4.47
CorefNMN (?) 0.641 50.92 80.18 88.81 4.45
GNN (?) 0.6285 48.95 79.65 88.36 4.57
RvA (?) 0.6634 52.71 82.97 90.73 3.93
Table 114: Results of different discriminative models on the validation split of the VisDial v0.9 dataset.
Model MRR [email protected] [email protected] [email protected] Mean
LF (?) 0.5199 41.83 61.78 67.59 17.07
HRE (?) 0.5237 42.29 62.18 67.92 17.07
HREA (?) 0.5242 42.28 62.33 68.17 16.79
MN (?) 0.5259 42.29 62.85 68.88 17.06
HCIAE-NP-ATT (?) 0.5386 44.06 63.55 69.24 16.01
CorefNMN (?) 0.535 43.66 63.54 69.93 15.69
CoAtt (?) 0.5411 44.32 63.82 69.75 16.47
CoAtt-RL (?) 0.5578 46.10 65.69 71.74 14.43
RvA (?) 0.5543 45.37 65.27 72.97 10.71
Table 115: Results of different generative models on the validation split of the VisDial v0.9 dataset.

6.1.4 Image Dialog - Discussion

For the Image Dialog task, two versions of the same dataset were used for evaluation. Similar approaches were used for the evaluation of both datasets with retrieval metrics. Nevertheless, the methods that achieve state-of-the-art performance on both datasets differ. For the generative and discriminative methods on VisDial v0.9 dataset, the Recursive Visual Attention (RvA) approach proposed by ? (?) achieves best results. RvA refines the visual attention recursively by browsing through the dialog history until the agent has sufficient confidence in its visual co-reference resolution. This has also been shown to generate interpretable attention maps without additional annotations. For the VisDial v1.0 dataset, the results presented in Table 116 show that Synergistic-ensemble by ? (?) outperform RvA.

Model MRR [email protected] [email protected] [email protected] Mean NDCG
LF (?) 0.5542 40.95 72.45 82.83 5.95 0.4531
LF-att (?) 0.5707 42.08 74.83 85.05 5.59 0.4976
HRE (?) 0.5416 39.93 70.45 81.50 6.41 0.4546
MN (?) 0.5549 40.98 72.30 83.30 5.92 0.4750
MN-att (?) 0.5690 42.43 74.00 84.35 5.59 0.4958
CorefNMN (?) 0.615 47.55 78.10 88.80 4.40 0.547
GNN (?) 0.6137 47.33 77.98 87.83 4.57 0.5282
RvA (?) 0.6303 49.03 80.40 89.83 4.18 0.5559
Synergistic-ensemble (?) 0.6342 49.30 80.77 90.68 3.97 0.5788
Table 116: Results of different discriminative models on the test-standard split of the VisDial v1.0 dataset.

6.2 Video Dialog

In the following, we present more details about Visual Dialog in which a video is used as the visual input.

6.2.1 Video Dialog - Introduction

The aim of video dialog is to leverage scene information containing both audio (which can be transcribed as subtitles) and visual frames to hold a dialog with humans in a natural language of choice about the content (?, ?). A successful system is expected to ground concepts from the question in the video while leveraging contextual cues from the dialog history. Figure 15 summarizes the task.

Figure 15: Given a video (represented as a sequence of frames), a question, and the dialog history, the Video Dialog Model generates answers based on these information.

Several approaches have been proposed to address the task, where initially multimodal attention-based video description features were used to improve dialog (?). Further, a novel baseline (?) analyzed components such as data representation, extraction, attention, and answer generation in order to show that there can be relative improvements as compared to other approaches.

6.2.2 Video Dialog - Datasets

Audio Visual Scene-Aware Dialog (AVSD)6666 66 https://video-dialog.com  (?) was created for the Scene-Aware Dialog Challenge, in which the agent grounds its responses on the dynamic scene, the audio, and the history (previous rounds) of the dialog. Table 117 presents statistics and splits of the AVSD dataset.

Split Dialogs Turns Words
Training 7,985 123,480 1,163,969
Validation 1,863 14,680 138,314
Test 1,968 14,660 138,790
Table 117: Splits of the AVSD dataset.

6.2.3 Video Dialog - Evaluation Measures, Models, and Results

In this section, we review the evaluation measures used to benchmark different models of Video Dialog and the results obtained by these models.

Evaluation Measures.

To evaluate the Video Dialog models, the “Retrieval metrics” presented in Section 3.1.3 are used.

Models.

Only few models have been proposed to approach the task of Video Dialog. These models aim to capture the temporal aspect of a video and incorporate it in the textual dialog. In Table 118, we present some exemplar architectures (refer to Combined column) designed to address the task by integrating both video and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Approach Video Frame Language Combined Optimizer RL
 (?) I3D VGG LSTM MultimodalAtt ADAM
 (?) I3D VGG LSTM i3d-rgb-spatial-10 ADAM
Table 118: Exemplar Video Dialog architectures.
Results.

As discussed earlier only few models have been created to approach the task of Video Dialog. In Table 119 we present the results obtained with those models built using the “AVSD” dataset presented earlier in Section 6.2.2.

Model B-1 B-2 B-3 B-4 METEOR CIDEr
Att-base (?) 0.273 0.173 0.117 0.084 0.117 0.766
Att-weightshare (?) 0.293 0.191 0.133 0.097 0.127 0.923
i3d-rgb-spatial-10 (?) 0.290 0.190 0.133 0.097 0.127 0.928
Att-base-beam (?) 0.285 0.187 0.131 0.096 0.128 0.941
Table 119: Results of different models on the “AVSD” dataset.

6.2.4 Video Dialog - Discussion

The Video Dialog task is evaluated with the AVSD dataset. Different strategies have been explored to fuse the language and video features to create a strong baseline. In particular, the approach used by ? (?), which uses beam search and attention mechanism (i.e., Att-base-beam) over different modalities, outperforms other baseline methods.

7 Multimodal Machine Translation

In this section, we explore the task of Multimodal Machine Translation (MMT). The goal of this task is to translate natural language sentences that describe visual content (e.g. image) in a source language into a target language by taking the visual content as an additional input to the source language sentences.

7.1 Machine Translation with Image

In the following, we elaborate on Multimodal Machine Translation by considering image as the only visual input.

7.1.1 Machine Translation with Image - Introduction

The aim of MMT (?, ?, ?, ?) is to translate sentences that describe an image in a source language into a target language. However, for any given image the description can be written in different source languages, resulting in multiple source language descriptions. This situation opens up the possibility to propose different variants of the MMT task. The first variant is a single source translation task, in which the image description in a single source language is translated to a target language with additional cues from the corresponding image. Figure 16 summarizes this variant, where an image is accompanied with its description in English that needs to be translated by the model into a description in German.

Figure 16: Given an Image and its description in a source language (e.g. En), the Image-guided Machine Translation model produces a description in a target language (e.g. De).

The second variant is a target language description generation task with additional source language cues, i.e., multiple source language descriptions of the same image termed as multisource MMT. Figure 17 summarizes this variant, where an image is accompanied with its descriptions in English (en), French (fr), and Czech (cs), which are all used to generate the German (de) translation.

Figure 17: Given an Image and its description in multiple source languages (e.g. en, fr, cs), the Multisource Image-guided Machine Translation model produces a description in a target language (e.g. de).

Different approaches have been proposed to handle single source MMT by associating visual and textual features with multimodal attention (?). Further, a novel approach where a doubly-attentive decoder incorporated visual features to bridge the gap between image description and translation was proposed (?). In a similar vein, global visual features were incorporated in an attention-based multimodal NMT (?). This is achieved by attending to source-language words and parts of an image independently by means of two separate attention mechanisms. MMT task can also be solved using two sub-tasks: learning to translate, and learning visually grounded representations (?), both combined in a multi-task learning framework. Further, an advanced multimodal compact bilinear pooling method (?, ?) has also been used for MMT in which the outer product of two vectors combines the attention features of the two modalities. Another model (?) used a shared visual-language embedding and a translator for learning. This joint model leverages a visual attention grounding mechanism that links the visual semantics with the corresponding textual semantics. Due to the presence of large multimodal data on the web, noisy image captions have also been tried for MMT (?). A latent variable model (?) has also been attempted in which the latent variable can be seen as a multimodal stochastic embedding of an image and its description in a foreign language. MMT models have also been used in an adversarial setting. ? (?) found that even in the presence of visual features from unrelated images there is no significant performance degradation. Due to the recent success of unsupervised machine translation (?), there is also a growing interest in extending it for unsupervised MMT (?). Other studies (?) have reduced criticism of MMT by showing that under the limited textual context, MMT models are capable of leveraging the visual input to generate better translations. Regarding multisource models, ? (?) explored MMT using neural multi-source sequence-to-sequence learning.

7.1.2 Machine Translation with Image - Datasets

The main dataset used with the models above (Section 7.1.1) is the Multi30k-MMT6767 67 https://www.statmt.org/wmt18/multimodal-task.html dataset (?), extended using the Flickr30k dataset. Along with English, it contains human translated German, French, and Czech language sentences. The splits of this dataset can be found in Table 120.

Split Images Captions
Training 29,000 29,000
Validation 1,014 1,014
Test 1,000 1,000
Table 120: Splits of Multi30k-MMT for English, German, French, and Czech.

7.1.3 Machine Translation with Image - Evaluation Measures, Models, and Results

In this section, we review the evaluation measures used to benchmark different models of Machine Translation with Image and the results obtained by these models.

Evaluation Measures.

To evaluate Machine Translation with Image models, the “Retrieval metrics” presented in the Section 3.1.3 are used.

Models.

Several models have been created for the task of Machine Translation with Image. The aim of these models is to tackle translation using either a single or multiple language textual sources along with an image. In Table 121, we present some exemplar architectures (refer to Combined column) which integrate both image and language to address the task. We also include an “Optimizer” column that indicates the optimization techniques used to train those models.

Approach Image Language Combined Optimizer RL
 (?) ResNet-50 BiGRU DoubleAtt Adadelta
 (?) VGG BiGRU GVF Adadelta
 (?) Inception-V3* BiGRU Imagination ADAM
 (?) ResNet-50 BiGRU Lium-cvc-ensemble ADAM
 (?) ResNet-50 BiGRU VMMTF ADAM
 (?) ResNet-50 LSTM CUNI-ensemble ADAM
Table 121: Exemplar Machine Translation with Image architectures. * - compares with ResNet-50 and VGG also.
Results

In Table 122 and Table 123 we present the results obtained with a subset of models built using the Multi30k-MMT dataset presented earlier in Section 7.1.2.

Results of Different Methods
Model Language en de en fr en cs
BLEU 36.5 - -
DoubleAtt (?) METEOR 55.0 - -
BLEU 37.3 - -
GVF (?) METEOR 55.1 - -
BLEU 36.8 - -
Imagination (?) METEOR 55.8 - -
BLEU 41.0 56.7 -
Lium-cvc-ensemble (?) METEOR 60.5 73.0 -
BLEU 37.6 - -
VMMTF (?) METEOR 56.0 - -
BLEU 42.6 62.8 35.9
CUNI-ensemble (?) METEOR 59.4 77.0 32.7
Table 122: Machine Translation with Image on the Multi30k test set [2016 (en de), 2017 (en fr), 2018 (en cs)].
Results of Different Methods
Model Language en de en fr en cs
BLEU 32.5 40.6 31.8
CUNI-single (?) METEOR 52.3 61.0 30.6
BLEU 38.5 44.1 -
MeMAD (?) METEOR 56.6 64.3 -
Table 123: Machine Translation with Image on Multi30k test set [2018 (en de, en fr, en cs)].

7.1.4 Machine Translation with Image - Discussion

This task is evaluated using only one dataset, e.g., Multi30k-MMT, containing descriptions in three source languages and one target language. Results presented in Table 122 and Table 123 refer to the shared task proposed in different years. We can observe that based on different years of test set release, varied sets of approaches outperform the baseline methods.

7.2 Machine Translation with Video

In the following, we present more details about Multimodal Machine Translation by using the video as the visual input.

7.2.1 Machine Translation with Video - Introduction

The goal of video-guided machine translation (?) is to translate a source language description into the target language using the video information as additional spatio-temporal context. Figure 18 summarizes this approach where an video is accompanied with a English language description to be translated into the German description.

Figure 18: Video-guided Machine Translation.

7.2.2 Machine Translation with Video - Datasets

The VATEX6868 68 http://vatex.org/main/index.html  (?) dataset was created for English and Chinese languages to perform machine translation with video and also to generate multilingual video descriptions. Table 124 presents more details about the dataset.

Split Videos Action Label
Training 25,991
Validation 3,000
Public Test 6,000 -
Secret Test 6,278 -
Table 124: Splits of the VATEX dataset. Secret Test denotes human-annotated captions heldout for organizing challenges.

7.2.3 Machine Translation with Video - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Machine Translation with Video and the results obtained by them.

Evaluation Measures.

To evaluate the Machine Translation with Video models, the Language metrics presented in Section 3.1.3 are used.

Models.

Very few models have been created to investigate the task of Machine Translation with Video. The temporal aspect of a video is crucial for providing effective translations. In contrast to Machine Translation with Image, the task of Machine Translation with Video only has models which are built using single textual source. In Table 125, we present some exemplar architectures (refer to Combined column) which integrate both video and language inputs for addressing the task. We also include a column that showcases the optimization techniques used to train those models.

Approach Video Frame Language Combined Optimizer RL
 (?) I3D - LSTM NMT+LSTM VI ADAM
Table 125: Exemplar Machine Translation with Video architectures.
Results.

The models that have been created to address the task of Machine Translation with Video is built using a single dataset, namely VATEX. In Table 126 we present results obtained with a subset of models built using the VATEX dataset presented earlier in Section 7.2.2.

Model B-4 METEOR
NMT+LSTM VI (?) [English Chinese] 30.20 -
NMT+LSTM VI (?) [Chinese English] 27.18 -
Table 126: Comparison of different methods on the VATEX dataset.

7.2.4 Machine Translation with Video - Discussion

In Table 126, we observe that only one method utilizing LSTM with video features from the pretrained I3D model (i.e., NMT+LSTM VI) is evaluated using the language metrics on the challenging VATEX dataset for both English and Chinese.

8 Language-to-Vision Generation

In this section, we explore the task of Language-to-Vision Generation. The goal of this task is to generate visual content given their natural language descriptions. However, different variations of the task exists and will be discussed in the following.

8.1 Language-to-Image Generation

In the following, we present more details about Language-to-Image Generation in which an image is used as the visual input.

8.1.1 Language-to-Image Generation - Introduction

Different variations of the Language-to-Image Generation exists. For example, generation of an image can be seen as a manipulation of an image. It allows for the generation of a new image using desired natural language description. We present some variations in the following.

Sentence-level Language-to-Image Generation.

The goal is to generate images conditioned on the natural language descriptions. It is considered as a fundamental problem in many applications. The success of Generative Adversarial Networks (GANs) (?) has made possible the generation of interesting images of specific categories, such as room interiors, album covers, and faces (?). This has led to an interest in bridging the gap between natural language text and image modeling. Figure 19 shows that the natural language description is used to generate an image with a Text-to-Image Generation Model.

Figure 19: Given a natural language description, the Language-to-Image Model generates an image conditioned on the provided description.

Initially, alignDRAW (?) was introduced to iteratively draw patches on a canvas, while attending to the relevant words in the description. Further, it was shown that visual concepts could be translated from characters to pixels (?) with a conditional GAN. This was further improved by taking instructions about what content should be drawn in which location in order to achieve high-quality image generation (?). Models which were developed to condition on classes for image generation (?) have also been used to generate images. However, the quality of images generated is much lower than when not conditioning on classes. Very close to this approach is Text-conditioned Auxiliary Classifier GAN (TAC-GAN) (?) which conditions images on both the sentence and class information, which has been shown to improve their structural coherence. To generate images with high resolution, several GANs were stacked together yielding stackGAN (?, ?) that used a global sentence representation. This helped generate images of different sizes. To overcome the bottleneck of global-level sentence representation, attention-based GAN like AttGAN (?) was used to capture the fine-grained details at different sub-regions of the image. It pays attention to the relevant words in the natural language description. In other research efforts, a hierarchical approach (?) was taken by inferring the semantic layout of the image. Instead of learning a direct description to an image mapping, the generation process is decomposed into multiple steps. First a semantic layout from the text is constructed by the layout generator and then the layout is converted to an image by the image generator. Other kinds of approaches such as HDGAN (?) aim to accompany the hierarchical adversarial objectives inside the network to regularize mid-level representations and assist generator training in order to capture complex image information. This has been shown to generate images with high resolutions. Later, instead of dealing with natural-language descriptions, ? (?) used image-specific scene graphs enabling explicitly reasoning about objects and their relationships. Further, for obtaining better high resolution images, coarse-resolution features were taken as input and Perceptual Pyramid Adversarial Network (PPAN) was introduced to directly synthesize multi-scale images conditioned on texts in an adversarial way (?). Another approach named MirrorGAN (?) targets the main goal of visual realism and semantic consistency for generating images from text. It proposes global-local attention and semantics-preserving framework where the image generated from the text is further used to generate the text back. This has been shown to semantically align with the given text and generated description. In the following, we explore some of the related ideas which expand the scope of language-to-image generation.

Image Manipulation.

Image manipulation takes a different path from the earlier benchmark approaches about image generation, and so the TAGAN (?) was introduced to generate semantically manipulated images while preserving text-irrelevant contents. Here, the generator learns to generate images where only regions that correspond to the given text are modified. Another interesting approach is to have an interactive system that generates an image in an iterative manner. Recent approaches (?) used attention in both the generator and the discriminator, while others (?) have designed error correction modules to rectify mismatched attributes and complete missing contents of the generated image. There are also other variations where the source image is manipulated via natural language dialogue (?).

Fine-grained Image Generation.

Fine-grained image generation uses a recurrent image generation model (?) to take into account both the generated output up to the current step as well as all past instructions for generation. This has been shown to add new objects, apply simple transformations to existing objects, and correct previous mistakes. Earlier research never concentrated on fine-grained generation of images, i.e., localizing objects. Recently, control of the location of individual objects within an image was made possible (?) by adding a pathway in an iterative manner and applying them at different locations specified by the bounding boxes to both the generator and the discriminator.

Sequential Image Generation.

The sequential image generation approach StoryGAN (?), based on the sequential conditional GAN, concentrates on story by generating a sequence of images, when given a multi-sentence paragraph. Termed as story visualization, it behaves exactly opposite to image storytelling and has been shown to generate images with high quality, while also achieving contextual consistency.

8.1.2 Language-to-Image Generation - Datasets

For image generation, existing image datasets have been modified to accommodate image descriptions. Initially, the Oxford-1026969 69 http://www.robots.ox.ac.uk/~vgg/data/flowers/102 and Caltech-UCSD Birds (CUB)7070 70 http://www.vision.caltech.edu/visipedia/CUB-200-2011.html datasets consisting of flower and bird images belonging to 102 and 200 classes respectively are expanded with image descriptions (?). Table 127 and Table 128 presents splits of the datasets.

Split Images Captions per Image Total Captions
Training 5,878 10 58,780
Validation 1,156 10 11,560
Test 1,155 10 11,550
Total 8,189 10 81,890
Table 127: Splits of the Oxford-102 dataset with image descriptions.
Split Images Captions per Image Total Captions
Training 8,855 10 88,550
Validation - - -
Test 2,933 10 29,330
Total 11,788 10 117,880
Table 128: Splits of the CUB dataset with image descriptions.

Similarly, the MSCOCO dataset (see Section 3.1.2) is also used for the reversed task of description generation, i.e., given a description, generate the image matching the description. We represent this dataset as MSCOCO-Gen. Table 129 presents the splits of the dataset.

Split Images Captions per Image Total Captions
Training 82,783 5 413,915
Validation - - -
Test 40,504 5 202,520
Total 123,287 5 616,435
Table 129: Splits of the MSCOCO-Gen dataset.

8.1.3 Language-to-Image Generation - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Language-to-Image Generation and the results obtained by them.

Evaluation Measures.

There are different evaluation measures which are explicitly used for Language-to-Image generation and are discussed in detail in the following.

  • Inception Score (IS) (?) was initially proposed to compare the quality of images generated by GAN models. A pretrained Inception-v3 model (?) is applied to the generated image to get the conditional label distribution with low entropy. A similar idea is applied for the generated images on the given text descriptions for automatic evaluation. Higher scores are better for IS.

  • Fréchet Inception distance (FID) (?) is supposed to improve on IS by comparing the statistics of generated samples to original samples, instead of evaluating generated samples in an isolated manner. It also depends on the Inception-v3 model. In particular, the pool3 layer of the Inception-v3 is used for generating original samples for comparison. Lower FID is better as it corresponds to more similar generated and original samples.

  • R-precision is inspired from the ranking retrieval results. It is used as a complementary evaluation metric for the language-to-image generation. Specifically, generated images are used to query their corresponding natural language descriptions to find how many relevant descriptions are retrieved.

Models.

Many models have been created to approach the task of Language-to-Image Generation. In Table 130, we present some exemplar architectures (refer to Combined column) that integrate both image and language for addressing the task. We also include a column that showcases the optimization techniques used to train those models.

Approach Image Language Combined Optimizer RL
 (?) - char-CNN-RNN GAN-INT-CLS ADAM
 (?) - char-CNN-GRU GAWWN ADAM
 (?) - - StackGAN ADAM
 (?) Inception-v3 BiLSTM AttGAN -
 (?) - BiLSTM MirrorGAN -
Table 130: Exemplar Language-to-Image Generation architectures.
Results.

In Table 131, Table 132, and Table 133 we present results obtained with a subset of models built using the CUB, Oxford-102, and COCO datasets presented earlier in Section 8.1.2.

Model Resolution IS FID HR
GAN-INT-CLS (?) 64x64 2.88 ± .04 68.79 2.76 ± .01
64x64 3.10 ± .03 53.51 -
GAWWN (?) 128x128 3.62 ± .07 72.65 1.95 ± .02
64x64 3.02 ± .03 35.11 -
StackGAN (?) 256x256 3.70 ± .04 51.89 1.29 ± .02
StackGAN++ (?) 256x256 4.04 ± .05 15.30 1.19 ± .02
AttGAN (?) 256x256 4.36 ± .03 - -
MirrorGAN (?) 256x256 4.56 ± .05 - -
Table 131: Comparison of different methods using generated images of different resolutions on the “CUB” dataset. R-precision (%) for 256x256 with AttGAN (53.31) and MirrorGAN (57.67). HR - Human Ranking
Model Resolution IS FID HR
GAN-INT-CLS (?) 64x64 2.66 ± .03 79.55 1.84 ± .02
64x64 2.73 ± .03 43.02 -
StackGAN (?) 256x256 3.20 ± .01 55.28 1.16 ± .02
StackGAN++ (?) 256x256 3.26 ± .01 48.68 1.30 ± .03
Table 132: Comparison of different methods using generated images of different resolutions on the “Oxford-102” dataset.
Model Resolution IS FID HR
GAN-INT-CLS (?) 64x64 7.88 ± .07 60.62 1.82 ± .03
64x64 8.35 ± .11 33.88 -
StackGAN (?) 256x256 8.45 ± .03 74.05 1.18 ± .03
StackGAN++ (?) 256x256 8.30 ± .10 81.59 1.55 ± .05
PPGN (?) 256x256 9.58 ± .21 - -
AttGAN (?) 256x256 25.89 ± .47 - -
MirrorGAN (?) 256x256 26.47 ± .41 - -
Table 133: Comparison of different methods using generated images of different resolutions on the “COCO” dataset. R-precision (%) for 256x256 with AttGAN (72.13) and MirrorGAN (74.52).

8.1.4 Language-to-Image Generation - Discussion

The Language-to-Image Generation task has been evaluated using three different datasets. The CUB and Oxford-102 datasets contain only one visual object per image, while COCO has multiple objects. Several methods based on modified GAN objectives have been proposed for the generation of an image for a given textual description. From Table 131, Table 132, and Table 133 we observe the recent MirrorGAN (?) achieves best results for different image resolution types using task-specific measures on CUB and COCO. It is built on the idea of back-translation of the image to text. However, for Oxford-102, StackGAN++ (?) achieves the best result.

8.2 Language-to-Video Generation

In the following, we present more details about Language-to-Video Generation in which a video is used as the visual input.

8.2.1 Language-to-Video Generation - Introduction

The goal of Language-to-Video generation is to mimic language-to-image generation by considering the temporal aspect. However, language-to-video generation requires a stronger conditional generator than what is generally required for the language-to-image generation. This is because of the increase in dimensionality. To address this challenge, a conditional generative model is trained (?) to extract both static and dynamic information from text which combines variational autoencoders (VAE) (?) with GAN. Figure 20 shows that the natural language description is used to generate a video with text-to-video generation model.

Figure 20: Given a natural language description, the Language-to-Video model generates a video (represented as sequence of frames from ? (?)) conditioned on the description.

Another novel approach is to generate video from script. The composition, retrieval, and fusion network (Craft) model (?) is capable of learning knowledge from the video-description data and applying it in generating videos from novel captions. It has been shown that the Craft model performs better than the direct pixel generation approaches and generalizes well to unseen captions and to video databases with no text annotations.

8.2.2 Language-to-Video Generation - Datasets

For video generation there are no publicly available datasets. However, ? (?) have collected the Text2Video dataset belonging to ten different categories of YouTube videos, each ranging between 10-400 seconds for language-to-video generation. The categories of videos are biking in snow, playing hockey, jogging, playing soccer, playing football, kite surfing, playing golf, swimming, sailing and water skiing. For the purposes of model evaluation, the dataset is split into training, validation, and test sets in the ratio of 7:1:2 respectively, the details of which can be found in Table 134.

Split Videos
Training 2800
Validation 400
Test 800
Table 134: Splits of Text2Video (Combines all categories).

8.2.3 Language-to-Video Generation - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Language-to-Video Generation and the results obtained by them.

Evaluation Measures.

The Accuracy measure is used to evaluate the Language-to-Video Generation models.

Models.

Only a limited set of models have been created so far to handle the task of Language-to-Video Generation. In Table 135, we present some exemplar architectures (refer to Combined column) which integrate both video and language to address the task. We also include a column that showcases the optimization techniques used to train those models.

Approach Video Frame Language Combined Optimizer RL
 (?) MotionFeatures - LSTM T2V ADAM
Table 135: Exemplar Language-to-Video Generation architectures.
Results.

In Table 136 we present results obtained with a subset of models built using the “TexttoVideo” dataset presented earlier in Section 8.2.2.

Model Accuracy
DT2V-baseline (?) 0.101
PT2V (?) 0.134
GT2V (?) 0.192
T2V (?) 0.426
Table 136: Comparison of accuracy (%) scores of different models on Text2Video.

8.2.4 Language-to-Video Generation - Discussion

The task of Language-to-Video Generation is not as well-explored as the Language-to-Image generation task due to its complexity. Results presented in Table 136 show that the accuracy achieved with an approach proposed by ? (?) achieves best results. The accuracy is calculated using a simple video classifier which is a five-layer neural network with 3D full convolutions and ReLU nonlinearities.

9 Vision-and-Language Navigation

In this section, we explore the task of Vision-and-Language Navigation. The goal of this task is to carry out navigation in an environment by interpreting natural language instructions.

9.1 Image-and-Language Navigation

In the following, we present more details about the Image-and-Language Navigation task in which photorealistic images forming 3D environments are used as the visual input.

9.1.1 Image-and-Language Navigation - Introduction

Most of the attempts at Vision-and-Language Navigation (VLN) use photorealistic images forming 3D environments. The goal of the Image-and-Language Navigation (ILN) task is to enable an agent or a robot to carry out navigation in an environment defined by the photo-realistic image views by means of interpreting natural language instructions (?). This requires the agent/robot to simultaneously process both vision and language inputs and navigate from a source to a target location. Figure 21 summarizes the task.

Figure 21: Given an Image and few Language instructions (represented with a sequence of images from ? (?)), the Image-and-Language Navigation model is expected to carry out the navigation of an agent in an environment (indicated by arrows).

Initially, sequence-to-sequence models were proposed to address challenges in which the student forcing approach achieved promising results in previously explored environments. One approach (?) integrated a module to combine model-based and model-free reinforcement learning techniques to better generalize to unseen environments. There is also the reinforced cross-modal matching approach (?), which enforces both local and global cross-modal grounding via reinforcement learning. ILN can also be seen as a search on a navigation graph (?) with a progress monitor as a learnable heuristic for search. It is improved by leveraging a visual-textual co-grounding attention mechanism to better align the instructions and visual scenes, and incorporates a progress monitor to estimate the agent’s current progress towards the goal (?). Another substantial improvement came from training an action space with an embedded speaker model (?). New instructions are synthesized for data augmentation and pragmatic reasoning was implemented for evaluating how well candidate action sequences explain an instruction. Improving over earlier approaches that make local action decisions or score entire trajectories using beam search, the novel approach of the FAST framework (?) balances local and global signals when exploring the environment allowing it to act greedily, but use global signals to backtrack when necessary. Also, ? (?) explore a generalizable navigational agent by training it in two stages. In the first stage, mixed imitation and reinforcement learning is combined, while in the second stage, fine-tuning is performed via newly-introduced “unseen” triplets. ILN can also be seen as a form of visual question answering (see Section 5.1) that requires navigation to answer questions. Embodied Question Answering (?, ?) is explored with an agent that is spawned at a random location in a 3D environment and asked a question. For answering the question, the agent navigates through the 3D environment, finding the information observed in the question. Other attempts used interactive question answering (?) and grounded dialog (?). Another set of approaches (?) aims to map instructions to actions in 3D Environments with visual goal prediction. Recently, ? (?) also made an interactive learning framework to endow the agent with the ability to ask for users’ help in ambiguous situations.

9.1.2 Image-and-Language Navigation - Datasets

For the image-and-language navigation task, three different datasets were designed. In the following, we present the details of these datasets separately.

Room-2-Room (R2R).

The R2R7171 71 https://bringmeaspoon.org  (?) dataset consists of real images of previously unseen building-scale 3D environments. Table 137 presents splits of the dataset.

Split Scenes Instructions
Training 61 14,025
Validation (seen) 11 1,020
Validation (unseen) 11 2,349
Test 18 4,173
Table 137: Splits of the R2R dataset.
ASKNAV.

Similar to R2R, the ASKNAV7272 72 https://github.com/debadeepta/vnla  (?) dataset is built on top of Matterport3D7373 73 https://niessner.github.io/Matterport . However, the objective differs in that the agent queries the advisor when in confusion and makes progress accordingly. It contains 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. A data point in the dataset consists of a single starting viewpoint, but it has multiple goal viewpoints. Table 138 presents the splits of dataset.

Split Data points Goals
Training 94,798 139,757
Validation (seen) 4,874 7,768
Validation (unseen) 5,005 8,245
Test (seen) 4,917 7,470
Test (unseen) 5,001 7,537
Table 138: Splits of the ASKNAV dataset.
TOUCHDOWN.

Extending from building environments, the TOUCHDOWN7474 74 https://github.com/lil-lab/touchdown  (?) dataset is designed for addressing tasks such as executing navigation instructions (Navigation Only) and resolving spatial descriptions (SDR) in real-world environments. SDR is similar to the task of image referring expression (Section 4.1). The environment includes 29,641 panoramas (360 Google Street View RGB images) and 61,319 edges from the New York City. Table 139 has more details about the dataset, while Table 140 presents its splits.

Dataset Vocab. Mean Text
Dataset Size Size Length
TOUCHDOWN (Complete task) 9,326 5,625 108.0
Navigation Only 9,326 4,999 89.6
SDR Only 25,575 3,419 29.7
Table 139: Statistics of the TOUCHDOWN dataset. Vocabulary Size and Text Length are computed by combining the training and validation sets.
Task Split Examples
Training 6,526
Complete & Validation 1,391
Navigation Only Test 1,409
Training 17,880
SDR Only Validation 3,836
Test 3,859
Table 140: Splits of the TOUCHDOWN dataset.
Cooperative Vision-and-Dialog Navigation (CVDN).

CVDN7575 75 https://cvdn.dev/  (?) is a dataset7676 76 https://github.com/mmurray/cvdn/tree/master/tasks/CVDN/data of embodied, human-human dialogs situated in a simulated, photorealistic home environment. Table 141 presents some statistics about the dataset.

Navigation Dialogs Navigation Total Scenes
(Human-Human) Trajectories (MatterPort houses)
2,050 7,000 83
Table 141: Statistics of the CVDN dataset.
Action Learning From Realistic Environments and Directives (ALFRED).

ALFRED7777 77 https://askforalfred.com/  (?) is a benchmark and interactive visual dataset for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.

Data Number of Number of
Split Fold Scenes Annotations
Training - 108 21,023
Seen 88 820
Validation Unseen 4 821
Seen 107 1,533
Testing Unseen 8 1,529
Table 142: Splits of the ALFRED dataset.

9.1.3 Image-and-Language Navigation - Evaluation Measures, Models, and Results

In this section, we present the evaluation measures, models, and results achieved with various architectures of Image-and-Language Navigation.

Evaluation Measures.

The measures that are designed explicitly for the Image-and-Language Navigation system (e.g., R2R) are:

  • Path Length (PL): PL is a trajectory length where it is the total length of the executed path.

  • Navigation Error (NE): NE is based on the shortest path distance in the navigation graph, and is calculated by measuring the average distance between the end-location predicted by the follower agent and the true route’s end-location.

  • Success Rate (SR): SR is the percentage of predicted end-locations within 3 meters of the true location.

  • Oracle Success Rate (OSR): OSR measures the success rate at the closest point to the goal that the agent has visited along the trajectory.

  • Success Path Length (SPL): SPL is a trade-off between SR and PL, by weighting SR by inverse PL.

Models.

Many models have been created to approach the task of Image-and-Language Navigation. In Table 143, we present some exemplar architectures (refer to Combined column) which integrate both image and language to address the task. We also include a column that showcases the optimization techniques used to train those models.

Approach Image Language Combined Optimizer RL
 (?) ResNet-152 LSTM Seq-to-Seq ADAM
 (?) ResNet-152 LSTM RPA -
 (?) ResNet-152 LSTM Speaker-Follower -
 (?) ResNet-152 LSTM RCM ADAM
 (?) ResNet-152 LSTM Self-Monitoring ADAM
 (?) ResNet-152 LSTM BackTranslation RMSprop
 (?) - LSTM FAST -
Table 143: Exemplar Image-and-Language Navigation architectures.
Results.

As discussed earlier several models have been created to approach the task of Image-and-Language Navigation. Furthermore, many datasets have been created to provide variety in the content so that they improve the generalization ability of the models. In this section, we cover the results obtained by the models from a representative dataset for this task. Table 144, Table 145 and Table 146 presents results obtained with a subset of models built using the R2R dataset presented in Section 9.1.2.

Model PL NE OSR SR SPL
Random 9.89 9.79 18.3 13.2 12
Seq-to-Seq (?) 8.13 7.85 26.6 20.4 18
RPA (?) 9.15 7.53 32.5 25.3 23
Speaker-Follower (?) 14.82 6.62 44.0 35.0 28
Self-Monitoring (?) 18.0 - - 48.0 35
RCM (?) 15.22 6.01 50.8 43.1 35
BackTranslation-Single (?) 11.7 - - 51.5 47
TacticalRewind-Greedy (?) 22.08 5.14 - 54 41
BackTranslation-PreExplore (?) 9.79 - - 63.9 61
BackTranslation-Beam (?) 687 - - 68.9 1
FAST-Beam (?) 196.53 4.29 - 61.0 3
Table 144: Comparison of different methods on the R2R test set.
Model PL NE OSR SR SPL
Speaker-Follower (?) - 3.36 73.8 66.4 -
RCM+SIL (?) 10.13 2.78 79.7 73.0 -
BackTranslation-Single (?) 11.0 3.99 - 62.1 59
TacticalRewind-Greedy (?) - - - - -
BackTranslation-PreExplore (?) 9.92 4.84 - 54.7 52
BackTranslation-Beam (?) 703 2.52 - 75.7 1
FAST-Beam (?) 188.6 3.13 - 70.0 4
Table 145: Comparison of different methods on the seen validation set of R2R.
Model PL NE OSR SR SPL
Speaker-Follower (?) - 3.36 73.8 66.4 -
RCM+SIL (?) 10.13 2.78 79.7 73.0 -
BackTranslation-Single (?) 10.7 5.22 - 52.2 48
TacticalRewind-Greedy (?) 21.17 4.97 - 56.0 43
BackTranslation-PreExplore (?) 9.57 3.78 - 64.5 61
BackTranslation-Beam (?) 663 3.08 - 69.0 1
FAST-Beam (?) 224.42 4.03 - 63.0 2
Table 146: Comparison of different methods on the unseen validation set of R2R.

9.1.4 Image-and-Language Navigation - Discussion

Image-and-Language Navigation is evaluated with different splits of the R2R validation and test datasets. From Table 144, Table 145, and Table 146 we can observe that Frontier Aware Search with backTracking (FAST)-beam (?) achieves best results on the task-specific metrics. This approach balances local and global signals while exploring an unobserved environment. It also helps to act greedily but use global signals to backtrack whenever necessary.

10 Vision-and-Language Pretraining

Inspired by the works of pretraining only on vision (?) or solely on the language data (?, ?, ?), the vision-and-language pretraining seeks to jointly learn representations using both visual and textual content for improving the efficiency of previously discussed vision and language integration tasks. Several methods will be discussed for vision-and-language pretraining and the architectures can be broadly divided into Single-stream and Two-stream. In the following, we provide more details on both types of architectures.

Single-stream Architectures.

These neural architectures are based on BERT-like (?) models where they incorporate an Image Embedder, a Text Embedder, and a multi-layer Transformer (?). The proposed models are pretrained on data which in general have parallel multimodal components i.e., videos or images along with captions. Further, the models are optimized with a combination of different objectives such as visual-based and text-based Masked Language Models (MLM), masked visual-feature modeling, and visual-linguistic matching. Learned representations are then used for different downstream tasks such as multimodal understanding or generation. For example, the VideoBERT (?) architecture has been designed to learn vision-language representations for a generative downstream task like video description generation (Section 3.1.4). While there are several other approaches such as Bounding Boxes in Text Transformer (B2T2) (?), Unicoder-VL (?), VL-BERT (?), and UNITER (?) are all designed for multimodal understanding and facilitate downstream tasks. Works such as VLP (?) and OSCAR (?) built unified models that can jointly understand and generate from cross-modal data. There is also interest in probing vision-and-language pretrained models (?) to comprehend the contribution from each modality and also help in designing better model architectures and objectives.

Two-stream Architectures.

In contrast to single-stream, two-stream architectures adopted two independent encoders for learning visual and text representations. ViLBERT (?) and LXMERT (?) are examples of two-stream architectures which used self-attention principles to jointly learn representations from visual and textual data. ViLBERT builds a co-attentional transformer layer, while LXMERT uses a cross-modality encoder. Similar to single-stream, the two-stream architectures also optimize their models with pretraining tasks, such as MLM and vision-text matching. Sometimes they use additional text-only corpora for achieving better generalization on long and complex sentences.

In Table 147, we summarize both Single-stream and Two-stream architectures by presenting the vision and language integration tasks they support. It has to be noted that these architectures only use subsets of the datasets from each task. Also, the type of tasks they select are limited and are mostly discriminative. Broadly, we denote with (✓) or (✗) whether they support the task in question or not.

Approach VDG VS VRE VQA VR VE VDiag MMT LVG VLN
Single-stream
Unicoder-VL
VL-BERT
VideoBERT
VLP
OSCAR
B2T2
UNITER
Two-stream
ViLBERT
LXMERT
Table 147: Major Vision-and-Language Pretraining Architectures and their supprot of Integration of Vision and Language Tasks. VDG - Visual Description Generation, VS - Visual Storytelling, VRE - Visual Referring Expression, VQA - Visual Question Answering, VR - Visual Reasoning, VE - Visual Entailment, VDiag - Visual Dialog, MMT- Multimodal Machine Translation, LVG - Language-to-Vision Generation, VLN - Vision-and-Language Navigation

11 Future Directions

The integration of vision and language has come a long way since the pioneering works, particularly after the adoption of deep learning techniques. Although the performance of current state-of-the-art models still needs to catch up with human abilities, the gap is diminishing at a steady rate. However, there is still ample room for theoretical and algorithmic improvements. Here, we enumerate several possible future directions that have the potential to advance the research overall.

Learning Common Sense and World Knowledge.

There is abundant out-of-domain data available which is unpaired with vision and language task-specific corpora. Leveraging such information as factual, hierarchical, or commonsense knowledge can significantly improve the intelligence of vision and language systems. Prior work has been shown to assist independent NLP tasks with pretrained language models such as commonsense reasoning (?) and fact predictions (?). It has also shown promise for image caption generation (?, ?) and question answering (?, ?). Extending such ideas to other tasks would be an interesting research direction to pursue. Another possibility could be to utilize images, videos, and text in a synchronous and synergistic manner as they encode different aspects of the world and implicitly. Here, an open question would be how to extract world and common sense knowledge from these sources.

Addressing Large-scale Data Limitations.

Most approaches designed for tasks that integrate vision and language use large datasets for training. With this trend, it will soon become harder to design new tasks without having a dataset. To avoid these problems, future work will need to be adaptable to datasets of different sizes. Therefore, trade-off approaches are required where we know what amount of data is enough to master a certain task. This requires designing methods that leverage neuro-symbolic reasoning systems (?, ?) which can decide the required amount of data.

Combining Multiple Tasks.

Some tasks are capable of sharing some ideas or representations of each other. For example, visual referring expression comprehension can be viewed as a visual dialog task (?) where a sequence of questions is used to refer to an object in the image. Similarly, image caption generation can be viewed as the visual referring expression generation task (?).

Novel Neural Architectures for Representation.

Up until late 2017, the de facto standard for learning language and vision representations were RNNs and CNNs respectively. However, over the last few years, with the introduction of novel ideas that address the limitations of aforementioned neural network types either theoretically or computationally, there is a growing interest to adopt these new techniques. For instance, the Transformer (?) architecture that is used extensively for pure NLP tasks may see adoption for the integration of vision and language tasks. It has already shown its applicability for image caption generation (?). In a similar manner, graph neural networks (?, ?, ?) that were introduced to tackle graph-structured data, has already shown its promise in visual reasoning (?). Exploiting the compositionality of visual objects to describe an entire visual scene with neural modular networks is also an interesting direction to explore for many vision and language tasks.

Image vs Video.

Most of the research into integrating vision and language concentrates on static images. This trend is clearly visible from the array of datasets and methods available for image and language integration tasks. Nevertheless, although a complex task, similar attention needs to be embraced for videos for which there is a scarcity of datasets. For instance, there is only one dataset available for tasks such as Video Dialog (Section 6.2), Video Referring Expression (Section 4.2), Language-to-Video Generation (Section 8.2), and Machine Translation with Videos (Section 7.2), while tasks such as Vision-and-Language Navigation (Section 9) completely lack video-based datasets.

3D-Vision and Language.

The world that we inhabit is inherently 3D. Thinking from this perspective, restricting vision and language research to just 2D, viz. images and videos, might be a hindrance for real world agents, e.g., humanoid robots, to fully understand the complexities of the 3D world and navigate with ease. To avoid such pitfalls, algorithms and techniques need to be developed for processing 3D inputs such as RGB-D and point clouds in conjunction with language. Some pioneering works have already begun in this direction (?, ?) and we anticipate the trend to shift more towards developing algorithms for understanding as well as the generation of 3D scenes, while utilizing language as a main or auxiliary modality.

Automatic Evaluation Measures.

Automatic evaluation measures exist for several vision and language tasks. However, most of them are adaptations from standalone NLP tasks such as machine translation. For example, BLEU and METEOR metrics used for evaluating visual caption generation and storytelling models have been found not to correlate well with human judgements (?). The SPICE metric designed specifically for visual caption generation is dependent on parsing and is, therefore, not adaptable for other tasks such as storytelling. This kind of shortcoming shows us a promising research direction to pursue in developing evaluation measures applicable for several tasks. Similarly, language-to-vision generation, although having quantitative measures, is typically dependent on human evaluation. It needs to adopt novel techniques for effective quantitative evaluation. Other tasks such as vision-and-language navigation and visual reasoning have specific measures for evaluation which can be improved further.

12 Conclusion

In this survey, we elaborated on recent trends in the integration of vision and language research. Initially, we provided the background about the varied tasks in CV and NLP, and further identified ten different prominent tasks that integrate vision and language. In addition, we gave information about how each integration task is expanded from the standalone CV or NLP tasks on which they are based. Following that, we reviewed and analyzed each task separately by presenting a comprehensive introduction on how the tasks are designed in a bottom-up manner. Additionally, we presented different methods used to address the tasks, along with exemplar architectures designed to integrate vision and language representations. We also provided a brief review about the datasets, evaluation measures, and the relative performance obtained by state-of-the-art methods. Finally, in a separate section, we explored the various ways to pretrain with large scale multimodal data for supporting downstream vision and language integration tasks with minimal fine-tuning efforts. Moreover, we outlined how much the existing pretrain approaches support the ten prominent integration tasks described earlier. When comparing the standalone research done individually in the fields of CV and NLP, the synergy of both, fuelled by advanced machine learning techniques, are expected to be more intelligent and sustainable systems. Making them easily accessible can therefore have direct commercial and societal impact. However, despite the significant progress achieved in many integration tasks, large-scale evaluation of those systems show that they still fall behind human performance. This fact confirms that there is still a good deal of room for improvement. In particular, designing novel evaluation measures and architectures that can adequately deal with the complexity of vision and language integration problems has the potential to address the challenges. Hence, we concluded the survey with a few possible future research directions. We believe that our survey will help to systematize future research papers and also investigate the unresolved problems that are hindering the progress in the integration of vision and language research.

Acknowledgments

This work was supported by the German Research Foundation (DFG) as a part of - Project-ID 232722074 - SFB1102. We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript. We also acknowledge the insightful comments of Marius Mosbach on the first version of the draft. \nomenclature[A]AIArtificial Intelligence \nomenclature[A]AMTAmazon Mechanical Turk \nomenclature[A]ANetCapActivityNet Captions \nomenclature[A]ARELAdversarial REward Learning \nomenclature[A]AVSDScene-Aware Dialog \nomenclature[A]BDDBerkeley Deep Drive \nomenclature[A]BDD-XBerkeley Deep Drive eXplanation \nomenclature[A]BiLSTMBidirectional LSTM \nomenclature[A]BLEUBiLingual Evaluation Understudy \nomenclature[A]BRNNBidirectional Recurrent Neural Network \nomenclature[A]CIDErConsensus based Image Description Evaluation \nomenclature[A]CLEVRCompositional Language and Elementary Visual Reasoning \nomenclature[A]CLEVR-CoGenTCLEVR Compositional Generalization Test \nomenclature[A]CLIDCross-Lingual Image Description \nomenclature[A]CMMCascaded Mutual Modulation \nomenclature[A]CMRECross-Modal Relationship Extractor \nomenclature[A]CNNConvolutional Neural Networks \nomenclature[A]COGConfigurable Visual Question and Answer \nomenclature[A]CoAtt-GANCo-Attention GAN \nomenclature[A]CUBCaltech-UCSD Birds \nomenclature[A]DIIDescriptions of Images-in-Isolation \nomenclature[A]DISDescriptions of Images-in-Sequence \nomenclature[A]EVEExplainable Visual Entailment \nomenclature[A]FASTFrontier Aware Search with backTracking \nomenclature[A]FiLMFeature-wise Linear Modulation \nomenclature[A]GANsGenerative Adversarial Networks \nomenclature[A]GGCNGated Graph Convolutional Network \nomenclature[A]GNNGraph Neural Network \nomenclature[A]GRUGated Recurrent Units \nomenclature[A]GVFGlobal Visual Features \nomenclature[A]GQAGeneral Question Answering \nomenclature[A]KVQAKnowledge-aware Visual Question Answering \nomenclature[A]LBALearning-By-Asking \nomenclature[A]LGCNLanguage-Conditioned Graph Networks \nomenclature[A]LSTMLong Short-Term Memory \nomenclature[A]MACMemory, Attention, and Composition \nomenclature[A]MedRankMedian Rank \nomenclature[A]METEORMetric for Evaluation of Translation with Explicit Ordering \nomenclature[A]MILMultiple Instance Learning \nomenclature[A]MPIIMax Planck Institute for Informatics \nomenclature[A]MPII-MDMPII Movie Description \nomenclature[A]MRRMean Reciprocal Rank \nomenclature[A]MSCOCOMicrosoft Common Objects in COntext \nomenclature[A]MSR-VTTMicrosoft Research Video to Text \nomenclature[A]MSVDMicrosoft Video Description \nomenclature[A]MuRelMultimodal Relational network \nomenclature[A]M-VADMontreal Video Annotation \nomenclature[A]NDCGNormalized Discounted Cumulative Gain \nomenclature[A]NICNeural Image Captioning \nomenclature[A]NS-CLNeuro-Symbolic Concept Learner \nomenclature[A]NYC-StorytellingNew York City Storytelling \nomenclature[A]OK-VQAOutside Knowledge Visual Question Answering \nomenclature[A]R2RRoom-2-Room \nomenclature[A]RAVENRelational and Analogical Visual rEasoNing \nomenclature[A]R-CNNRegion-based CNN \nomenclature[A]REReferring Expression \nomenclature[A]RGB-DRed, Green, Blue, Depth \nomenclature[A]RNsRelation Networks \nomenclature[A]ROUGERecall Oriented Understudy for Gisting Evaluation \nomenclature[A]RvARecursive Visual Attention \nomenclature[A]SCRCSpatial Context Recurrent Convnet \nomenclature[A]SGDStochastic Gradient Descent \nomenclature[A]SISStories for Images-in-Sequence \nomenclature[A]SINDSequential Image Narrative Dataset \nomenclature[A]SPICESemantic Propositional Image Captioning Evaluation \nomenclature[A]TACoSTextually Annotated Cooking Scenes \nomenclature[A]VATEXVideo And TEXt \nomenclature[A]VISTVisual Storytelling \nomenclature[A]V-SNLIVisually-grounded Natural Language Inference \nomenclature[A]VTWVideos Titles in the Wild \printnomenclature[2cm]

References

  • Aafaq et al. Aafaq, N., Mian, A., Liu, W., Gilani, S. Z., and Shah, M. (2020). Video description: A survey of methods, datasets, and evaluation metrics.  ACM Comput. Surv., 52(6), 115:1–115:37.
  • Achlioptas et al. Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., and Guibas, L. (2020). Referit3d: Neural listeners for fine-grained 3D object identification in real-world scenes.  In 16th European Conference on Computer Vision (ECCV), August 23-28, 2020. Springer.
  • Aditya et al. Aditya, S., Saha, R., Yang, Y., and Baral, C. (2019). Spatial knowledge distillation to aid visual reasoning.  In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 227–235. IEEE.
  • Agrawal et al. Agrawal, A., Batra, D., Parikh, D., and Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980. IEEE Computer Society.
  • Agrawal et al. Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Parikh, D., and Batra, D. (2017). Vqa: Visual question answering.  International Journal of Computer Vision, 123(1), 4–31.
  • Agrawal et al. Agrawal, H., Anderson, P., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., and Lee, S. (2019). nocaps: novel object captioning at scale.  In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 8947–8956. IEEE.
  • Agrawal et al. Agrawal, H., Chandrasekaran, A., Batra, D., Parikh, D., and Bansal, M. (2016). Sort story: Sorting jumbled images and captions into stories.  In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 925–931.
  • AlAmri et al. AlAmri, H., Cartillier, V., Das, A., Wang, J., Cherian, A., Essa, I., Batra, D., Marks, T. K., Hori, C., Anderson, P., Lee, S., and Parikh, D. (2019a). Audio visual scene-aware dialog.  In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 7558–7567. Computer Vision Foundation / IEEE.
  • Alamri et al. Alamri, H., Hori, C., Marks, T. K., Batr, D., and Parikh, D. (2019b). Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7.  In DSTC7 workshop at AAAI.
  • Alberti et al. Alberti, C., Ling, J., Collins, M., and Reitter, D. (2019). Fusion of detected objects in text for visual question answering.  In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2131–2140.
  • Anderson et al. Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2016). Spice: Semantic propositional image caption evaluation.  In European Conference on Computer Vision, pp. 382–398. Springer.
  • Anderson et al. Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2017). Guided open vocabulary image captioning with constrained beam search.  In EMNLP.
  • Anderson et al. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018a). Bottom-up and top-down attention for image captioning and visual question answering.  In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086.
  • Anderson et al. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., and van den Hengel, A. (2018b). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683.
  • Andreas et al. Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016a). Learning to compose neural networks for question answering.  In Knight, K., Nenkova, A., and Rambow, O. (Eds.), NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 1545–1554. The Association for Computational Linguistics.
  • Andreas et al. Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016b). Neural module networks.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48. IEEE Computer Society.
  • Aneja et al. Aneja, J., Deshpande, A., and Schwing, A. G. (2018). Convolutional image captioning.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561–5570.
  • Antol et al. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. (2015). Vqa: Visual question answering.  In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433.
  • Bahdanau et al. Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate.  In Bengio, Y., and LeCun, Y. (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Bai and An Bai, S., and An, S. (2018). A survey on automatic image caption generation.  Neurocomputing, 311, 291–304.
  • Baltrušaitis et al. Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy.  IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.
  • Banerjee and Lavie Banerjee, S., and Lavie, A. (2005). METEOR: An automatic metric for mt evaluation with improved correlation with human judgments.  In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72.
  • Barrault et al. Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D., and Frank, S. (2018). Findings of the third shared task on multimodal machine translation.  In Bojar, O., Chatterjee, R., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Jimeno-Yepes, A., Koehn, P., Monz, C., Negri, M., Névéol, A., Neves, M. L., Post, M., Specia, L., Turchi, M., and Verspoor, K. (Eds.), Proceedings of the Third Conference on Machine Translation: Shared Task Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pp. 304–323. Association for Computational Linguistics.
  • Battaglia et al. Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. (2018). Relational inductive biases, deep learning, and graph networks.  CoRR, abs/1806.01261.
  • Baumann et al. Baumann, A., Boltz, M., Ebling, J., Koenig, M., Loos, H., Merkel, M., Niem, W., Warzelhan, J., and Yu, J. (2008). A review and comparison of measures for automatic video surveillance systems.  EURASIP Journal on Image and Video Processing, 2008(1), 824726.
  • Bernardi et al. Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B. (2016). Automatic description generation from images: A survey of models, datasets, and evaluation measures..  Journal of Artificial Intelligence Research (JAIR), 55, 409–442.
  • Blösch et al. Blösch, M., Weiss, S., Scaramuzza, D., and Siegwart, R. (2010). Vision based mav navigation in unknown and unstructured environments.  In 2010 IEEE International Conference on Robotics and Automation, pp. 21–28. IEEE.
  • Bottou Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent.  In Proceedings of COMPSTAT’2010, pp. 177–186. Springer.
  • Bowman et al. Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for learning natural language inference.  In Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., and Marton, Y. (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 632–642. The Association for Computational Linguistics.
  • Brown et al. Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.  Computational linguistics, 16(2).
  • Brown et al. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.  In arXiv preprint arXiv:2005.14165.
  • Burke Burke, H. R. (1958). Raven’s progressive matrices: A review and critical evaluation.  The Journal of Genetic Psychology, 93(2), 199–228.
  • Cadène et al. Cadène, R., Ben-younes, H., Cord, M., and Thome, N. (2019). MUREL: multimodal relational reasoning for visual question answering.  In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 1989–1998. Computer Vision Foundation / IEEE.
  • Caglayan et al. Caglayan, O., Aransa, W., Bardet, A., García-Martínez, M., Bougares, F., Barrault, L., Masana, M., Herranz, L., and van de Weijer, J. (2017). LIUM-CVC submissions for WMT17 multimodal translation task.  In Bojar, O., Buck, C., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., Jimeno-Yepes, A., Koehn, P., and Kreutzer, J. (Eds.), Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pp. 432–439. Association for Computational Linguistics.
  • Caglayan et al. Caglayan, O., Madhyastha, P., Specia, L., and Barrault, L. (2019). Probing the need for visual context in multimodal machine translation.  In Burstein, J., Doran, C., and Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4159–4170. Association for Computational Linguistics.
  • Calixto and Liu Calixto, I., and Liu, Q. (2017). Incorporating global visual features into attention-based neural machine translation.  In Palmer, M., Hwa, R., and Riedel, S. (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 992–1003. Association for Computational Linguistics.
  • Calixto et al. Calixto, I., Liu, Q., and Campbell, N. (2017). Doubly-attentive decoder for multi-modal neural machine translation.  In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1913–1924.
  • Calixto et al. Calixto, I., Rios, M., and Aziz, W. (2018). Latent visual cues for neural machine translation.  CoRR, abs/1811.00357.
  • Cao et al. Cao, J., Gan, Z., Cheng, Y., Yu, L., Chen, Y.-C., and Liu, J. (2020). Behind the scene: Revealing the secrets of pre-trained vision-and-language models.  In 16th European Conference on Computer Vision (ECCV), August 23-28, 2020. Springer.
  • Cao et al. Cao, Q., Liang, X., Li, B., Li, G., and Lin, L. (2018). Visual question reasoning on general dependency tree.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7249–7257.
  • Cao et al. Cao, Y., Long, M., Wang, J., Yang, Q., and Yu, P. S. (2016). Deep visual-semantic hashing for cross-modal retrieval.  In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1445–1454. ACM.
  • Carion et al. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers.  In 16th European Conference on Computer Vision (ECCV), August 23-28, 2020. Springer.
  • Carreira and Zisserman Carreira, J., and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset.  In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308.
  • Chang et al. Chang, S., Yang, J., Park, S., and Kwak, N. (2018). Broadcasting convolutional network for visual relational reasoning.  In Proceedings of the European Conference on Computer Vision (ECCV), pp. 754–769.
  • Chen et al. Chen, D. Z., Chang, A. X., and Nießner, M. (2020). Scanrefer: 3d object localization in RGB-D scans using natural language.  In 16th European Conference on Computer Vision (ECCV), August 23-28, 2020. Springer.
  • Chen and Dolan Chen, D. L., and Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation.  In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 190–200. Association for Computational Linguistics.
  • Chen et al. Chen, H., Suhr, A., Misra, D., Snavely, N., and Artzi, Y. (2019). Touchdown: Natural language navigation and spatial reasoning in visual street environments.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12538–12547.
  • Chen et al. Chen, T., Liao, Y., Chuang, C., Hsu, W. T., Fu, J., and Sun, M. (2017). Show, adapt and tell: Adversarial training of cross-domain image captioner.  In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 521–530. IEEE Computer Society.
  • Chen and Lawrence Zitnick Chen, X., and Lawrence Zitnick, C. (2015). Mind’s eye: A recurrent visual representation for image caption generation.  In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2422–2431.
  • Chen et al. Chen, Y.-C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. (2020). Uniter: Universal image-text representation learning.  In 16th European Conference on Computer Vision (ECCV), August 23-28, 2020. Springer.
  • Cheng et al. Cheng, Y., Gan, Z., Li, Y., Liu, J., and Gao, J. (2018). Sequential attention GAN for interactive image editing via dialogue.  CoRR, abs/1812.08352.
  • Chi et al. Chi, T., Shen, M., Eric, M., Kim, S., and Hakkani-Tür, D. (2020). Just Ask: An interactive learning framework for vision and language navigation.  In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 2459–2466. AAAI Press.
  • Cho et al. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation.  In Moschitti, A., Pang, B., and Daelemans, W. (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724–1734. ACL.
  • Chrupała et al. Chrupała, G., Gelderloos, L., and Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal.  In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 613–622.
  • Chung et al. Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling.  CoRR, abs/1412.3555.
  • Cirik et al. Cirik, V., Berg-Kirkpatrick, T., and Morency, L.-P. (2018a). Using syntax to ground referring expressions in natural images.  In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Cirik et al. Cirik, V., Morency, L., and Berg-Kirkpatrick, T. (2018b). Visual referring expression recognition: What do systems actually learn?.  In Walker, M. A., Ji, H., and Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp. 781–787. Association for Computational Linguistics.
  • Condoravdi et al. Condoravdi, C., Crouch, D., De Paiva, V., Stolle, R., and Bobrow, D. G. (2003). Entailment, intensionality and text understanding.  In Proceedings of the HLT-NAACL 2003 workshop on Text meaning.
  • Conneau and Lample Conneau, A., and Lample, G. (2019). Cross-lingual language model pretraining.  In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 7057–7067.
  • Cornia et al. Cornia, M., Baraldi, L., and Cucchiara, R. (2019). Show, control and tell: A framework for generating controllable and grounded captions.  In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 8307–8316. Computer Vision Foundation / IEEE.
  • Dai et al. Dai, B., Fidler, S., Urtasun, R., and Lin, D. (2017). Towards diverse and natural image descriptions via a conditional GAN.  In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2989–2998. IEEE Computer Society.
  • Dai and Lin Dai, B., and Lin, D. (2017). Contrastive learning for image captioning.  In Advances in Neural Information Processing Systems, pp. 898–907.
  • Das et al. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., and Batra, D. (2018a). Embodied question answering.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2054–2063.
  • Das et al. Das, A., Gkioxari, G., Lee, S., Parikh, D., and Batra, D. (2018b). Neural modular control for embodied question answering.  In CoRL, Vol. 87 of Proceedings of Machine Learning Research, pp. 53–62. PMLR.
  • Das et al. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M. F., Parikh, D., and Batra, D. (2017a). Visual dialog.  In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1080–1089. IEEE Computer Society.
  • Das et al. Das, A., Kottur, S., Moura, J. M. F., Lee, S., and Batra, D. (2017b). Learning cooperative visual dialog agents with deep reinforcement learning.  In ICCV, pp. 2970–2979. IEEE Computer Society.
  • Das et al. Das, P., Xu, C., Doell, R. F., and Corso, J. J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2634–2641. IEEE Computer Society.
  • Dash et al. Dash, A., Gamboa, J. C. B., Ahmed, S., Liwicki, M., and Afzal, M. Z. (2017). TAC-GAN - text conditioned auxiliary classifier generative adversarial network.  CoRR, abs/1703.06412.
  • De Mulder et al. De Mulder, W., Bethard, S., and Moens, M.-F. (2015). A survey on the application of recurrent neural networks to statistical language modeling.  Computer Speech & Language, 30(1), 61–98.
  • de Vries et al. de Vries, H., Shuster, K., Batra, D., Parikh, D., Weston, J., and Kiela, D. (2018). Talk the walk: Navigating new york city through grounded dialogue.  CoRR, abs/1807.03367.
  • de Vries et al. de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., and Courville, A. C. (2017). Guesswhat?! visual object discovery through multi-modal dialogue.  In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 4466–4475. IEEE Computer Society.
  • Delbrouck and Dupont Delbrouck, J.-B., and Dupont, S. (2017a). An empirical study on the effectiveness of images in multimodal neural machine translation.  In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 910–919.
  • Delbrouck and Dupont Delbrouck, J., and Dupont, S. (2017b). Multimodal compact bilinear pooling for multimodal neural machine translation.  CoRR, abs/1703.08084.
  • Deng et al. Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., and Tan, M. (2018). Visual grounding via accumulated attention.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7746–7755.
  • Deng et al. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database.  In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee.
  • Deshpande et al. Deshpande, A., Aneja, J., Wang, L., Schwing, A. G., and Forsyth, D. A. (2018). Diverse and controllable image captioning with part-of-speech guidance.  CoRR, abs/1805.12589.
  • Devlin et al. Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding.  In Burstein, J., Doran, C., and Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics.
  • Dodge et al. Dodge, J., Gane, A., Zhang, X., Bordes, A., Chopra, S., Miller, A. H., Szlam, A., and Weston, J. (2016). Evaluating prerequisite qualities for learning end-to-end dialog systems.  In Bengio, Y., and LeCun, Y. (Eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Donahue et al. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description.  In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634.
  • El-Nouby et al. El-Nouby, A., Sharma, S., Schulz, H., Hjelm, D., El Asri, L., Ebrahimi Kahou, S., Bengio, Y., and Taylor, G. W. (2018). Keep drawing it: Iterative language-based image generation and editing.  In Neural Information Processing Systems (NeurIPS) Visually-Grounded Interaction and Language (ViGIL) Workshop.
  • Elliott Elliott, D. (2018). Adversarial evaluation of multimodal machine translation.  In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 2974–2978. Association for Computational Linguistics.
  • Elliott et al. Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. (2017). Findings of the second shared task on multimodal machine translation and multilingual image description.  In Bojar, O., Buck, C., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., Jimeno-Yepes, A., Koehn, P., and Kreutzer, J. (Eds.), Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pp. 215–233. Association for Computational Linguistics.
  • Elliott et al. Elliott, D., Frank, S., and Hasler, E. (2015). Multi-language image description with neural sequence models.  CoRR, abs/1510.04709.
  • Elliott et al. Elliott, D., Frank, S., Sima’an, K., and Specia, L. (2016). Multi30k: Multilingual english-german image descriptions.  In Proceedings of the 5th Workshop on Vision and Language, hosted by the 54th Annual Meeting of the Association for Computational Linguistics, [email protected] 2016, August 12, Berlin, Germany. The Association for Computer Linguistics.
  • Elliott and Kádár Elliott, D., and Kádár, À. (2017). Imagination improves multimodal translation.  In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 130–141.
  • Elliott and Keller Elliott, D., and Keller, F. (2013). Image description using visual dependency representations.  In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302.
  • Fang et al. Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., et al. (2015). From captions to visual concepts and back.  In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482.
  • Farhadi et al. Farhadi, A., Hejrati, S. M. M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. A. (2010). Every picture tells a story: Generating sentences from images.  In ECCV (4), Vol. 6314 of Lecture Notes in Computer Science, pp. 15–29. Springer.
  • Ferraro et al. Ferraro, F., Mostafazadeh, N., Huang, T. K., Vanderwende, L., Devlin, J., Galley, M., and Mitchell, M. (2015). A survey of current datasets for vision and language research.  In Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., and Marton, Y. (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 207–213. The Association for Computational Linguistics.
  • FitzGerald et al. FitzGerald, N., Artzi, Y., and Zettlemoyer, L. (2013). Learning distributions over logical forms for referring expression generation.  In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1914–1925.
  • Fried et al. Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., and Darrell, T. (2018). Speaker-follower models for vision-and-language navigation.  In Advances in Neural Information Processing Systems, pp. 3318–3329.
  • Frome et al. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. (2013). Devise: A deep visual-semantic embedding model.  In Advances in neural information processing systems, pp. 2121–2129.
  • Fukui et al. Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding.  In Su, J., Carreras, X., and Duh, K. (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 457–468. The Association for Computational Linguistics.
  • Gan et al. Gan, C., Gan, Z., He, X., Gao, J., and Deng, L. (2017). Stylenet: Generating attractive visual captions with styles.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137–3146.
  • Gao et al. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. (2015). Are you talking to a machine? dataset and methods for multilingual image question.  In Advances in neural information processing systems, pp. 2296–2304.
  • Gao et al. Gao, L., Chen, D., Song, J., Xu, X., Zhang, D., and Shen, H. T. (2019). Perceptual pyramid adversarial networks for text-to-image synthesis.  In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 8312–8319. AAAI Press.
  • Gao et al. Gao, L., Guo, Z., Zhang, H., Xu, X., and Shen, H. T. (2017). Video captioning with attention-based lstm and semantic consistency.  IEEE Transactions on Multimedia, 19(9), 2045–2055.
  • Gatt and Krahmer Gatt, A., and Krahmer, E. (2018). Survey of the state of the art in natural language generation: Core tasks, applications and evaluation.  Journal of Artificial Intelligence Research, 61, 65–170.
  • Gatys et al. Gatys, L. A., Ecker, A. S., and Bethge, M. (2016). Image style transfer using convolutional neural networks.  In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423.
  • Gella and Keller Gella, S., and Keller, F. (2017). An analysis of action recognition datasets for language and vision tasks.  In Barzilay, R., and Kan, M. (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pp. 64–71. Association for Computational Linguistics.
  • Gella et al. Gella, S., Lewis, M., and Rohrbach, M. (2018). A dataset for telling the stories of social media videos.  In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 968–974. Association for Computational Linguistics.
  • Geman et al. Geman, D., Geman, S., Hallonquist, N., and Younes, L. (2015). Visual turing test for computer vision systems.  Proceedings of the National Academy of Sciences, 112(12), 3618–3623.
  • Golland et al. Golland, D., Liang, P., and Klein, D. (2010). A game-theoretic approach to generating spatial descriptions.  In Proceedings of the 2010 conference on empirical methods in natural language processing, pp. 410–419. Association for Computational Linguistics.
  • Goodfellow et al. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets.  In Advances in neural information processing systems, pp. 2672–2680.
  • Gordon et al. Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., and Farhadi, A. (2018). Iqa: Visual question answering in interactive environments.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098.
  • Goyal et al. Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., and Parikh, D. (2019). Making the v in vqa matter: Elevating the role of image understanding in visual question answering.  International Journal of Computer Vision, 127(4), 398–414.
  • Graham et al. Graham, Y., Awad, G., and Smeaton, A. (2018). Evaluation of automatic video captioning using direct assessment.  PloS one, 13(9), e0202789.
  • Grönroos et al. Grönroos, S., Huet, B., Kurimo, M., Laaksonen, J., Mérialdo, B., Pham, P., Sjöberg, M., Sulubacak, U., Tiedemann, J., Troncy, R., and Vázquez, R. (2018). The memad submission to the WMT18 multimodal translation task.  In Bojar, O., Chatterjee, R., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Jimeno-Yepes, A., Koehn, P., Monz, C., Negri, M., Névéol, A., Neves, M. L., Post, M., Specia, L., Turchi, M., and Verspoor, K. (Eds.), Proceedings of the Third Conference on Machine Translation: Shared Task Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pp. 603–611. Association for Computational Linguistics.
  • Gu et al. Gu, J., Cai, J., Wang, G., and Chen, T. (2018). Stack-captioning: Coarse-to-fine learning for image captioning.  In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Gu et al. Gu, J., Wang, G., Cai, J., and Chen, T. (2017). An empirical study of language cnn for image captioning.  In Proceedings of the IEEE International Conference on Computer Vision, pp. 1222–1231.
  • Guo et al. Guo, D., Xu, C., and Tao, D. (2019). Image-question-answer synergistic network for visual dialog.  In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 10434–10443. Computer Vision Foundation / IEEE.
  • Guo et al. Guo, G., Zhai, S., Yuan, F., Liu, Y., and Wang, X. (2018). Vse-ens: Visual-semantic embeddings with efficient negative sampling.  In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Guo et al. Guo, L., Liu, J., Yao, P., Li, J., and Lu, H. (2019). Mscap: Multi-style image captioning with unpaired stylized text.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4204–4213.
  • Gupta et al. Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., and Kembhavi, A. (2018). Imagine this! scripts to compositions to videos.  In Proceedings of the European Conference on Computer Vision (ECCV), pp. 598–613.
  • Gururangan et al. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. R., and Smith, N. A. (2018). Annotation artifacts in natural language inference data.  In Walker, M. A., Ji, H., and Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp. 107–112. Association for Computational Linguistics.
  • Harabagiu et al. Harabagiu, S. M., Pasca, M. A., and Maiorano, S. J. (2000). Experiments with open-domain textual question answering.  In COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics.
  • Haurilet et al. Haurilet, M., Roitberg, A., and Stiefelhagen, R. (2019). It is not about the journey; it is about the destination: Following soft paths under question-guidance for visual reasoning.  In Proceedings of the IEEE conference on computer vision and pattern recognition.
  • He et al. He, K., Gkioxari, G., Dollár, P., and Girshick, R. B. (2017). Mask R-CNN.  In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2980–2988. IEEE Computer Society.
  • He et al. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.  In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
  • Helcl et al. Helcl, J., Libovický, J., and Varis, D. (2018). CUNI system for the WMT18 multimodal translation task.  In Bojar, O., Chatterjee, R., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Jimeno-Yepes, A., Koehn, P., Monz, C., Negri, M., Névéol, A., Neves, M. L., Post, M., Specia, L., Turchi, M., and Verspoor, K. (Eds.), Proceedings of the Third Conference on Machine Translation: Shared Task Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pp. 616–623. Association for Computational Linguistics.
  • Hendricks et al. Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., and Darrell, T. (2016). Deep Compositional Captioning: Describing novel object categories without paired training data.  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10.