Vision-to-language tasks aim to integrate computer vision and naturallanguage processing together, which has attracted the attention of manyresearchers. For typical approaches, they encode image into featurerepresentations and decode it into natural language sentences. While theyneglect high-level semantic concepts and subtle relationships between imageregions and natural language elements. To make full use of these information,this paper attempt to exploit the text guided attention and semantic-guidedattention (SA) to find the more correlated spatial information and reduce thesemantic gap between vision and language. Our method includes two levelattention networks. One is the text-guided attention network which is used toselect the text-related regions. The other is SA network which is used tohighlight the concept-related regions and the region-related concepts. At last,all these information are incorporated to generate captions or answers.Practically, image captioning and visual question answering experiments havebeen carried out, and the experimental results have shown the excellentperformance of the proposed approach.
Quick Read (beta)
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-Language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations and decode it into natural language sentences. While they neglect high-level semantic concepts and subtle relationships between image regions and natural language elements. To make full use of these information, this paper attempt to exploit the text-guided attention and semantic-guided attention to find the more correlated spatial information and reduce the semantic gap between vision and language. Our method includes two level attention networks. One is the text-guided attention network which is used to select the text-related regions. The other is semantic-guided attention network which is used to highlight the concept-related regions and the region-related concepts. At last, all these information are incorporated to generate captions or answers. Practically, image captioning and visual question answering experiments have been carried out, and the experimental results have shown the excellent performance of the proposed approach.
Vision-to-Language (V2L) tasks aim to integrate natural language processing and computer vision together. Typical V2L tasks are image captioning [41, 21, 30, 23], visual question answering (VQA) [40, 33, 15] and video description [7, 3, 49, 19, 50]. Recently, due to the advent of artificial intelligence (AI), V2L tasks have attracted extensive attention. Practically, V2L tasks enable many important applications, including early childhood education, human-robot interaction, visually impaired people assistance and so on.
Many recent approaches for V2L tasks have achieved a lot of gratifying results through combining Convolutional Neural Networks (CNNs) and Recurrent Neural networks (RNNs) for image encoding and text generating, respectively [16, 36, 26, 31]. Concretely, a CNN pre-trained on ImageNet  is used to extract global image feature while a RNN is used to encode the language information. Most of the recently approaches are belong to the “CNN-RNN” paradigm and these approaches have attained some promising results, further improvements should be got over some limitations.
I-A Motivation and Overview
Image high-level semantic concepts (also called image attributes, i.e., objects, actions, scenes and object’s attributes of images) are very important information for V2L tasks. In previous works, the spatial attention based methods (to distinguish the semantic-guided attention network, the spatial attention network is called text-guided attention network in this paper) are the most popular scheme for V2L tasks. Namely, the relationships between natural language elements with image regions via computing the attention weights between words/questions and image regions. The core idea of the spatial attention mechanism is that every word in captions or every question should only correspond to one or several regions of an image. Although the spatial attention based methods can dig up the subtle relationships between image and text elements, high-level semantic concepts have not been fully utilized for V2L tasks in most of the previous works, while these concepts are important for humans when understanding a scene [47, 37]. Actually, the semantic concepts bridge images and natural language information together and can contribute significantly to eliminate the well-known semantic gap. That is because the semantic concepts are not only the important high-level visual information for images, but also the important component of captions. For example, image in Fig. 1 shows an brown cat. According to the semantic concepts in this paper, the words “brown” and “cat” are high-level information of the image, and they provide very important information for understanding the image. Meanwhile, the two words also the components of the corresponding captions. To eliminate the semantic gap between images and natural language, the high-level semantic information is an extra input of the proposed method.
Moreover, every high-level semantic concept should correlated to a specific image region. Some previous methods, such as [37, 38], attempt to use the image semantic information to complete the V2L tasks, the semantic information is encoded into vectors and directly input into the language generating model. However, this process is not the optimal one because it cannot dig up the relationships between the semantic concepts and image regions. In this paper, a semantic-guided attention network is designed to explore the relationships between image semantic concepts and image regions. Namely, the image semantic concepts information is used to attend the corresponding regions. Other works like  use the attention mechanism for image semantic information, but the guidance information is natural language. In other words, the high-level semantic information is attended the by the corresponding caption or the question. Actually, every semantic concept is more correlated to a specific image regions. For instance, image in Fig. 1 shows a brown cat. Both the concept words “cat” and “brown” are important elements of the captions, while they correspond to the “cat” regions. So, exploring the relationships between image regions and image semantic concepts are more effective than the relationships between image semantic concepts and natural language elements.
Motivated by the aforementioned two reasons, we propose a methods with a semantic-guided attention network. The semantic-guided attention network contains two sub-parts which are used to highlight the concept-related regions and the region-related concepts, respectively. In addition, text-guided attention network is also reserved to explore the subtle relationships between image regions and natural language parts. For example, when describing the content of the image in Fig. 1, the phrase “brown cat” should map with the “cat” region of the image. For VQA task, the question is “What is the cat on?”. When answering this question, the “cat” and its surrounding regions should be focused on because these regions are most related to the question. So, to simultaneously learn the relationships between high-level semantic concepts and image regions and the correlations between natural language elements and image regions, we unify two sub-attention networks (semantic-guided attention network and tex-guided attention network) into a framework.
Fig. 1 shows the overall scheme of the proposed approach. The approach mainly includes two level attention networks. One is the text-guided attention network which is used to select text-related regions. The text-guided attention network has two variants for image captioning and VQA, respectively. In image captioning task, the text-guided attention network is called word-guided attention which is used to explore the relationships between words and image regions. In VQA task, the text-guided attention network is called question-guided attention which is used to select the image regions corresponding to the question. The other is the semantic-guided attention network which is used to dig up the relationship between image regions and high-level semantic concepts. The outputs of these two networks are projected into the same multi-modal space to generate captions or answers.
The core contributions can be summarized as follows:
1) An approach based on image high-level semantic attributes and local image features is proposed to address the challenges of V2L tasks. Specially, the high-level semantic attributes information is used to reduce the semantic gap between images and text.
2) An novel semantic-guided attention network is designed to explore the mapping relationships between semantic attributes and image regions. The semantic-guided attention network highlights the concept-related regions and selects the region-related concepts.
3) Two special V2L tasks (i.e., image captioning and visual question answering) are addressed by the proposed approach. Taking into account their characteristics, two sub-models was designed for image captioning and VQA, respectively. Experimental results show that our models are effective for V2L tasks.
II Related Work
With the development of deep learning, some related and recent work on deep learning has been researched for visual content analysis [11, 10, 5]. In this section, some typical methods for V2L tasks, i.e., image captioning and VQA, are introduced.
II-A Image caption generation
Using a natural language sentence to describe the content of the given image has long been researched in artificial intelligence. A traditional approach is to use predefined visual templates to generate sentences by filling detected visual concepts. Kuznetsova et al.  pose the image caption generation task as a retrieval problem. They first retrieval a similar image and the corresponding descriptions from the training set, and compose a new sentence based on the retrieval descriptions. Sentences generated by these methods are less variety and very limited, which cannot describe the contents of the test image very well.
Recent works using the deep neural networks has gained many encouraging results on image caption generating task. Mao et al.  proposed a multi-modal recurrent neural network (m-RNN) to explore the relationships between vision and text information. This model predicts the next word by computing the probability distribution of the next word conditioned on the previous words and visual features at each time-step. Karpathy et al.  also proposed a multi-modal RNN model to generate sentences to describe the content of a given image. But in contrast to m-RNN, the image features are input into the multi-modal RNN only at the first time-step. Vinyals et al.  proposed a similar method, which combined deep CNN for image feature extracting with an LSTM for sentence generating. Donahue et al.  proposed an unified model for activity recognition, image captioning and video description. To generate captions for image, this model use multiple layers of LSTM. Wu et al. [37, 38] proposed a caption generation model based on attributes. They use the most common words as the semantic attributes. At the sentence generating step, not only the global image feature is input into RNN, but the semantic attribute vector also be used as one input of RNN.
Attention-based model becomes a hot topic on image caption generation. Xu et al.  proposed an attention model to solve the image caption generation problem. In contrast to the previous models, it uses the output of last convolutional layer as the image features. Through flattening the feature map into 196 vectors, each vector denotes one region of the image. At each time-step, only one or several regions are selected by the attention mechanism.  proposed an image captioning model with semantic attention. It uses a set of attribute detectors to get some semantic concepts and the attention mechanism can select specific items form these concepts. Fu et al.  proposed a model based on spatial attention and scene-specific contexts.
II-B Visual question answering
Malinowski et al.  may be the first researchers to study the “open-world” visual question answering problem. They proposed a method with two important parts. One for semantic text parsing and the other is image segmentation with a Bayesian formulation to sample from nearest neighbors in the training set. This approach is very dependent on the human defined predicates and the accuracy of the image segmentation. Tu et al.  proposed a question answering based on joint parse graph from text and videos. All these early approaches have a common shortage: the answer is limited on the form of question.
Recently, deep neural network models have gained many encouraging results in the field of computer vision and natural language processing. Inspired by these encouraging results, an architecture based on “CNN-RNN” has become the most popular trend. Gao et al.  used CNN to encode the image. Another two RNNs are used to encode the question and generate the answer, respectively. Similar to , Malinowski et al. also proposed a method based on “CNN-RNN” architecture. However,  only used one RNN as question encoder and decoding the image and question into answer. In , Ren et al. took the visual question answering as classification problem. Their method used the LSTM as the question encoder and the image was treated as the first world. The answer was generated from an classifier which is a softmax layer. The input of the softmax layer was the output of the last time-step of the LSTM. Wu et al.  proposed a method which contains two different LSTMs to encode the question together with decode it and image information into answer with multiple words. It is worth noting that this model used the global image feature and image attribute vector output from the attribute detector as image information. Their team also did another work. This work was more complicate than ever before because on the basis of , they added external knowledge and caption vector as another two inputs to the encoder LSTM. They encoded five descriptions into vectors and pooled these vectors into one vector as caption vector. Noh et al.  used CNN with dynamic parameter prediction to solve the image question answering problem. To reduce the complexity of the problem, they incorporated a hashing technique to select the weights.  proposed a model with CNN architectures for learning not only for image and question, but also their inter-modal relationships to produce the answering.
A limitation of the most aforementioned methods is that they only use global image feature to represent the input image. This may lead to some irrelevant or noisy information input into the answering module. To address the aforementioned problem, attention mechanism is widely used in question answering system. A typical model is SANs  which is short for stacked attention networks. This model used semantic representation of a question to search for the corresponding regions in an image which related to the question and the answer. It also stacked the attention network because the authors argued that visual question answering needs to multiple steps of reasoning. Shih et al.  presented a method which learns to answer questions by selecting image regions relevant to the questions. Unlike to , which used one layer neural network to compute the attention distribution, this model mapped question queries and image features from various regions into a shared space through an inner product manuscription. Xu et al.  proposed a spatial memory network to the visual question answering task. Their memory networks were recurrent neural networks with attention mechanism that choose relevant regions stored in memory.  presented an dual attention model which jointly used visual and textual attention to capture the fine-grained relationship between vision and language. Lu et al.  proposed a co-attention for both image and question. Different from the most above models, this model used an hierarchical question encoding. Kazemi et al.  proposed a strong baseline for visual question answering. Their model used two-layer convolutional neural network to realize the stacking attention and produce probabilities over answer classes. Yu et al.  presented a multi-level attention model which contained context-aware visual attention and semantic attention modules. The context-aware module used a question to select relevant regions and the semantic attention module aimed to find important concepts. Xiong et al.  proposed a model named dynamic memory network which mainly contained two important parts: input module and episodic memory module. The core component of the input module is the bidirectional gated recurrent unit which was used to explore the relationship between local regional image features. In fact, the episodic memory module is also an attention module, which extracted a contextual vector based upon the current focus.
III Proposed Approach
Fig. 2 shows the overall framework of our approach for image captioning and VQA, two typical V2L tasks. Both the two sub-frameworks consist of six part: 1) a multi-label CNN, 2) an attribute layer, 3) a bidirectional GRU module, 4) semantic-guided attention network, 5) text-guided attention network and 6) joint embedding layer. The first two modules are used to extract image attributes and local features. The bidirectional GRU  module is used to explore the relationships among the local image features. The concept vectors output from the attribute layer and the proposed local image feature output from the bidirectional GRU are input into semantic-guided attention network. This is designed to highlight the concept-related regions and select the region-related concepts. The text-guided attention network explores the fine-grained mapping relationship between language elements and image regions. All the information is fused in the joint embedding layer. At last, the multi-modal information is used to generate the caption or the answer. There are two little differences between the two sub-frameworks. First, image captioning is considered as a generating problem, which means the captions are generated word by word. While VQA is treated as a classification problem, which all the answers are processed as class labels. Second, the text-guided attention network in image captioning and VQA are called word-guided attention network and question-guided attention network, respectively.
III-A Image Concepts Predicting
To train an image concepts predictor, the concepts vocabulary should be built at first. Similar to , we collect all words from the MS COCO image captioning dataset . All words are reverted to the prototype (i.e., the form of nouns and the tense of verbs are not sensitive). To select the concept words, the word frequencies are counted at first. And then the meaningless words (e.g., “a”, “is”, “on” and so on) are abandoned. After the rough screening, we select the most frequent words as the image semantic concepts candidate.
After constructing the concept vocabulary, we label each image with a -dimensional vector through comparing the captions with the concept vocabulary. We then train the concept detector (i.e., Multi-label CNN in Fig. 2). As a results, each image is represented as a concept vector and each element denotes the probability of the corresponding concept. To train the concept predictor, the last softmax layer of the single label CNN is replaced by the sigmoid cross entropy loss layer. Suppose that there are training samples and (where ) is the attribute label of -th image. And the loss function is defined as follows:
According to , the concept set of image is defined as follows:
where denotes the indicator function, is a threshold and we set it as 0.6 in this paper, is a vector where the -th element equal to 1 and the other elements equal to 0.
III-B Local Image Feature Processing
As illustrated in , we use a pre-trained CNN (i.e., VGG-19 in this paper) to extract local image features. When a raw image is input into VGG-19, we flat the feature map output from the CONV5-4 layer. The process can be written as follows:
where denotes the feature of -th location of image . In other words, each image is divided into locations and every represents one location. So, we call is the location feature representation.
The local image feature extracted from above do not yet have global information available for them. Without global information, their representational power is quite limited because it suffers from the simple issues like locational variance causing accuracy problems or object scaling. According to , the bidirectional RNN can solve the aforementioned problem. Following this idea, we use a bidirectional GRU to explore the relationship among regions (As illustrated in Fig. 3). The formulas are shown as follows:
where and are the hidden states of forward and backward GRU at time-step , respectively. At last, the sum of and denotes the context-aware visual representation of the -th image region . We denote .
III-C Semantic-guided Attention
To find the fine-grained relationship between image regions and semantic concepts, we propose a semantic-attention network. We connect the visual representation and semantic concepts representation by similarity between them at all image-concepts and concept-regions. Specifically, given an image representation , and the concepts representation , the similarity matrices are calculated as follows:
where and represent the similarity matrices between image regions and concepts. Concretely, and are scores which represent the similarity of the -th concept with -th region representation and the similarity of the -th region representation with -th concept, respectively. , , , , , , , and denote weights parameters. Note that “” represents the addition of a matrix and a vector. The addition between a matrix and a vector is performed by adding each column of the matrix by the vector.
After calculating the similarity matrices, the formula of attention weights is as follows:
Based on the above attention weights, the concept-based image region representation and the region-based concept representation are calculated as follows:
After computing the weighted image and concept representations, we concatenate them into a vector which contains the image feature and semantic concepts representation. The formula is:
III-D Image Captioning
The proposed model for image captioning is summarized in Fig. 2 (a). Similar to , our language generation model is trained by maximizing the probability of the correct description conditioned on the given image. Combined with our model, the log-likelihood function can be written as follows:
where is the description of image , is the -th word of the sentence , and is the length of the sentence. Based on Eq. (9), the probability of generating the word (i.e., ) is determined by the output of the semantic-guided attention network and the previous words . We exploit GRU to model this.
III-D1 Sentence Representation
In our model, we encode words into one-hot vectors. For example, the benchmark dataset has different words, and every word is encoded into a -dimension vector in which only one value equals to 1 and others equal to 0. When a raw image is input into our model, a corresponding sentence is generated which is encoded as a sequence of one-hot vectors. We denote , where represents the -th word in the sentence. We project these words into embedding space. The concrete formula is as follows:
where is the embedding matrix of sentences which projects the word vector into the embedding space. So the projection matrix is a matrix where is the size of the dictionary and is the dimensionality of the embedding space.
III-D2 Word-guided Attention
Similar to , we use word-guided attention mechanism for local feature. At each time-step, the attention mechanism uses the previous hidden state which concludes the previous words information to decide the local feature. The attention model is defined as follows:
where and are weights, is bias. is a probability vector whose each value denotes the probability of the corresponding local image feature. In our algorithm, we use the soft attention model. Therefore, , the word-related region representation at time-step , is calculated as follows:
Through Eq. (12) we know that decides which locals should be emphasized at the current time-step.
III-D3 Gate for
To control when and how much should be input into sentence generation GRU, we design a gate to achieve it. The gate is defined as follow:
where is weight vector and is bias. After calculating the gate, the is controlled as following formula:
III-D4 Sentence Generating
After getting and , we use GRU to generate description for the given image. The formulae are as follows:
The loss function is written as follows:
where is the number of training images and is the length of the sentence for the -th training image. equals to in Eq. (16).
III-E Visual Question Answering
The model for VQA is illustrated in Fig. 2 (b). Similar to , we take VQA as a classification task. So, all the information should be jointly embedded into a classifier. Given the image and corresponding question , we expect the probability of the correct answer to reach maximum. The object function can be written as follows:
where denotes the representation of question , represents the image feature and attribute representation of image . After encoding each question into a vector , we calculate the question-guided region representation.
III-E1 Question-guided Attention Network
Similar to word-guided attention in Section III-D2, we design a question-guided attention network to select the question-related regions which can improve the accuracy of the answer. The formula of the attention weight is as follows:
where is the inner product operation symbol, and are projection matrices which project the question and region representation into the -dimensional multi-modal space. After that, the question-related region can be represented as follow:
III-E2 Joint Embedding
Finally, we feed all the vectors (i.e., and ) into classifier with an joint embedding layer to generate the answer. This can be represented as the following formulae:
where , and are the parameters of the last parameter layer, the input of the classifier, and are calculated in Eq. (8) and Eq. (20), respectively, is the distribution of probability of answer candidates. The answer is the maximum probability of the candidates.
The loss function can be written as follows:
where is the number of the train examples.
IV-A Train Details and Experimental Setup
This section mainly shows the training details and the parameter setting. For both image captioning and VQA tasks, the variants of our models are trained with stochastic gradient descent  within adaptive learning rates. Specially, for the Flickr30K , MS COCO , VQA  and COCO-QA , Adam algorithm is used. For Flickr8K [13, 14], RMSProp is used to train the models. The parameter setting is shown in the following subsections.
IV-A1 Local Image Feature
In the proposed model, deep features generated from the CONV5-4 layer of VGG-19 are used to represent the images. The dimensionality of the feature map output from the Conv5-4 layer is . Through flattening operation, the feature map is transformed into . So, in Section III-B, the parameters and .
IV-A2 Image Concepts Encoding
In the proposed method, concepts word are collected from the MS COCO image captioning dataset. So, each image is encoded into a -dimensional vector. In other words, in Section III-A.
IV-A3 Word Encoding
In our model, we encode words into one-hot vectors. For example, the benchmark dataset has different words, every word is encoded into a -dimension vector, in which only one value is equal to 1 and others are equal to 0. So the location of 1 in the vector denotes the corresponding word in the dictionary. It implies that in Section III-D1 equals . Specially, after filtering words less than 5 times in the training set, the value of equals 2538, 7414 and 8791 words for Flickr8K, Flickr30K and MS COCO, respectively.
IV-A4 Question Encoding
To encode the questions, we first cast all question words which appear at least twice in the training and validation sets into lowercase. After collecting the question words vocabulary, each word is represented as one-hot vector. We use one layer Gated Recurrent Unit (GRU) with -dimensionality hidden state to encode the question, and the last hidden state of the GRU as the question representation. So the parameter in Section III-E.
IV-A5 Other Parameters
IV-B Image Captioning
IV-B1 Dataset and Evaluation Metrics
Dataset. We report results on the most popular three datasets: Flickr8K, Flickr30K and MS COCO. Among them, Flickr8K and Flickr30K have 8,092 and 31,783 images respectively, and each image has 5 reference sentences. MS COCO dataset has 123,287 images and the most images has 5 reference sentences. Before the experiment, we preprocess the datasets as  did. First, we convert all letters of sentences to lowercase, remove non-alphanumeric characters and get rid of words that occur less than five times on the training set. Second, we discard these data which have more than 5 corresponding sentences to guarantee that every image has the same number of describing sentences. For MS COCO, we evaluate our model with the widely used publicly available splits in .
Evaluation Metrics. We report results with the BLEU , METEOR  and CIDEr  metrics which are the most frequently used in the caption generation literature. The first two metrics are originally designed for evaluating the quality of the automatically machine translation. BLEU score represents the precision ratio of the generated sentence compared with the reference sentences. METEOR score reflects the precision and recall ratio of the generated sentence. It is based on the harmonic mean of uniform precision and recall. CIDEr measures consistency between n-gram occurrences in generated and reference sentences, where this consistency is weighted by n-gram saliency and rarity. For BLEU, we report the scores from BLEU-1 to BLEU-4, which denote the precision of N-gram (N equals to 1, 2, 3 and 4). For both metrics, the higher score they are, the higher quality of the generated sentences they have.
IV-B2 Results on Flickr8K and Flickr30K
|Att-SVM + LSTM ||73||53||38||26||-||-||68||49||33||23||-||-|
|Att-GlobalCNN + LSTM ||72||53||38||27||-||-||70||50||35||27||-||-|
|RA + SS ||61.3||43.0||29.6||19.8||19.5||48.9||63.5||44.7||31.1||21.4||19.2||44.8|
We compare our method with several state-of-the-art methods on the Flickr8K and Flickr30K datasets. The contrast models can be roughly divided into three categories. The first category, such as NeralTalk , Google-NIC  and m-RNN  in Table I, only uses the global image feature extracted by CNN, and only the feature is input into the sentence generator RNN. The second category is attribute-based models, which the global image feature and attribute vector are used to sentence generating. In Table I, Att-SVM + LSTM  and Att-GlobalCNN + LSTM  belong to this category. The third category is attention-based models. The attention-based model try to explore the relationship between image regions and words. NIC-VA , ATT  and RA  et al. in Table I are all attention-based models. Table I reports the image captioning results on the Flickr8K and Flickr30K. Between the contrast models, the attribute-based models show better performance than attention-based models. Specially for the Flikr8K, the Att-GlobalCNN + LSTM brings significant improvements nearly for B-1, for B-2, for B-3 and for B-4 on average. And the similarity improvements on the Flickr30K dataset. The phenomenon implies that the high-level semantic information (i.e., attributes) is very important for image captioning task. Compared with the basic models (i.e., none attribute and none attention are used), the attention-based models show much better performance for image captioning. The main reason is that the attention-based models can dig up the relationship between the image regions and sentence elements.
Although the state-of-the-art attribute-based models and attention-based models show good performance on the Flickr8K and Flickr30K datasets, Table I shows that our model gains a much better results on these datasets (only the B-1 score less than Att-SVM + LSTM on the Flickr8K dataset). The main reason is our model combines the semantic information and attention mechanism masterly. Specially, both the semantic information vector output from the semantic-guided attention network and the image region feature are selected by the word-guided attention network are used to generate the description sentence.
IV-B3 Results on MS COCO
|Att-SVM + LSTM ||69||52||38||28||23|
|Att-CNN + LSTM ||74||56||42||31||26|
Table II shows image captioning results on the MS COCO dataset. Similar to the experiment on the Flickr8K and Flickr30K, the contrast models also be classified into three categories, i.e., none attention and none attribute models (such as NeuralTalk, Google-NIC, LRCN and m-RNN), only attribute-based models (such as Att-CNN + LSTM) and only attention-based models (such as NIC-VA and ATT-FCN). Among the contrast models, Att-CNN +LSTM gets the highest scores both on BLEU and METEOR metrics. It shows that the high-level semantic information is important to transform an image into natural language sentence. That is to say, the high-level semantic information can contribute significantly to eliminate the semantic gap between vision and language. Compared with the proposed model in this paper, no attention mechanism has been used in Att-CNN + LSTM. In other words, the attributes information is encoded into one vector and imported into the language model. Table II shows that our model gets higher scores on most of the metrics. It implies that our model with attention mechanism is more effective than Att. In addition, the results of attention-based models are obviously better than the none attribute-based and attention-based models. The fact indicates that the attention mechanism can find fine-grained relationship between image region and sentence element, and this relationship is effective for image captioning.
Although the attention-based and attribute-based models show the powerful ability on image captioning task, our model further improves the performance. There are two main reasons: 1) the word-guided attention network can find the fine-grained relationship between image regions and words; 2) the semantic-guided attention network adds high-level semantic information which contributes to eliminating the semantic gap between vision and language.
Fig. 4 shows the visualization of generated captions, attributes and image attention maps on the MS COCO dataset. According to the Fig. 4, we see that our model successfully learns to align the local image regions, image attributes and words. For instance, when generate captions for the first image in the third row, the attribute layer predicts four attributes (i.e., “man”, “next”, “motorcycle” and “build”) of this image. When generate the word “riding”, the proposed model attends the most related region (i.e., the man’s region of the image). The histogram shows the attention weights of the four attributes, and the attention weights are computed by Eq. (7). All the instances prove that the model can explore the relationships among the attributes, local image regions and captions very well.
IV-B4 Ablation Study
To verify the effectiveness of each component in our model, we perform ablation studies by ablating certain components:
None attention is used for image captioning (None-Att). The word-guided and semantic-guided attention networks are abandoned. Only the global image feature output from FC7 layer of VGG-19 are used to generate descriptions.
Word-guided attention (WA). Only the word-guided attention network is used.
Word and semantic-guided attention (WSA). The two attention networks are both used while the gate for are abandoned.
Word and semantic-guided attention with gating controlling (FULL). Our full model.
Table III shows the performance of the ablation models. The results confirm the truth that both the word-guided and semantic-guided attention networks. First, the WA model improves the performance on the bias of the Non-Att model. That is because the word-guided attention network can automatically focus on the most word-related regions. Second, due to introducing the concept information and semantic-guided attention network, the WSA model further improves the performance on the bias of the WA model. The concept information is an important supplement for image information and the semantic-guided attention network explores the relationship between the concepts and regions. Last, our full model, which adds a gate to control the vector output from the semantic-guided attention model, is an improved version of the WSA model. As can be seen in Tabel III, the gate is necessary, because the gate automatically controls whether and how much the concept-region representation should be input to the RNN module at each time-step.
Fig. 5 shows some examples of image captioning on the validation set of the MS COCO dataset. Generally speaking, our full model shows best performance among the ablation models. The None-Att model may loss the attribute information and some important object information when generates description for image. For example, the second instance of the first row in Fig. 5, the None-Att model can describe the main object in the scene (such as “cat ” and “laptop”), but the attribute of the “cat” (i.e., the color of the cat—yellow) and some important objects (such as “desk” and “textbook”) are not been described. Two main reasons may led to such a result: 1) the attribute information (i.e., the color of the cat) is not be used to generate the caption; and 2) the None-Att model does not have the word-guided attention network which can exploit the attention-transfer mechanism. In other words, the None-Att model only focuses on the main region (the “cat” and “laptop” region) and describes it, but it cannot transfer the attention into other regions (such as the “desk” and “textbook” region). The WA model tries to dig all important object in an image, but it may make some mistakes. For instance, no “book” in the second image of the third row in Fig. 5, but the WA model identifies some object as “book”. Simultaneously, the “posters” are missed. However, the WSA model has expressed the information of the “posters”. This is mainly because the “poster” is an attribute word and the WSA has the semantic-guided attention network which can make full use of the attribute information. Our FULL model not only considers both the semantic information and the relationship between the word and image region, but also uses an gate to control when an how much the semantic information output from the semantic-guided attention network should be used to generated the description. This structure can correct some mistakes by the WSA. For example, when the WSA model describe the first image of the third row in Fig. 5, two “coffee table” are generated, but our FULL model correct this mistake and generate the right caption—“dining-room”.
IV-C Visual Question Answering
|Model||Object||Number||Color||Location||Accuracy||[email protected]||[email protected]|
|CoATT + VGG ||65.6||49.6||61.5||56.8||63.3||73.0||91.3|
|LSTM Q + I ||78.9||35.2||36.4||53.7||79.0||35.6||36.8||54.1|
|SMem-VQA Two-Hop ||80.87||37.32||43.12||57.99||80.8||37.53||43.48||58.24|
|DAN (VGG) ||82.1||38.2||50.2||62.0||-||-||-||-|
|DAN (ResNet) ||83.0||39.1||53.9||64.3||82.8||38.1||54.0||64.2|
|MLAN (ResNet) ||82.9||39.2||52.8||63.7||-||-||-||-|
|MLAN (ResNet, train + val) ||83.8||40.2||53.7||64.6||83.7||40.9||53.7||64.8|
|MLAN (ResNet, train + val + VG) ||81.8||41.2||56.7||65.3||81.3||41.9||56.5||65.2|
IV-C1 Dataset and Evaluation Metrics
Dataset. We report VQA results on Toronto COCO-QA, VQA dataset which are most popular publicly available visual question answering datasets based on MS COCO. Toronto COCO-QA dataset contains 8,000 images with 79,000 question/answer pairs for training and 4,000 images with 39,171 question/answer pairs for testing. The questions have four types (i.e., object, number, color and location). The answers are all single-word. VQA dataset is a much larger dataset which contains 614,163 questions. The training and testing split follows COCO official split, which contains 82,783 training images, 40,504 validation images and 81,434 test images, each has 3 questions and 10 answers. We use the official test split for our testing. The dataset has two different tasks : open-ended and multiple-choice tasks. We only report the experiment result on open-ended task.
Evaluation Metrics. We formulate VQA as a classification problem. The proposed model is evaluated with classification accuracy. The WUPS score  is also reported. The WUPS calculates the similarity between two words based on the similarity between their common subsequence in the taxonomy.
IV-C2 Results on COCO-QA Dataset
Table IV shows the results on the COCO-QA dataset. We categorize the contrast models as i) none attribute and attention models, ii) only attribute-based models and iii) only attention-based models (every category is separated with double horizontal line in Table IV). From the Table IV, we can easily draw a conclusion that both the attention-based models and the attribute-based models significantly improved accuracy (about increase on both the four types of questions) on the COCO-QA dataset. The Att-LSTM model shows more powerful performance on the COCO-QA dataset than the none-att based (none-attribute and none-attention based) models and the attention-based models (except the CoATT + VGG model  which include image attention and question attention). It confirms that the high-level semantic information is important to solve the VQA problem. In addition, the results of attention-based models are obviously better than the none att-based and attention-based models. It main because the attention model can focus on the important region which is very correlation with the question.
Through the results in Table IV, we find that our model improves the state-of-the-art from (CoATT + VGG ) to . For the different types of questions, all the models in Table IV show less powerful performance on the Number and Location questions than the Object and Color Question. That mainly caused by unbalanced data: the COCO-QA dataset contains Object questions, Number questions, Color questions and Location questions. However, our model and the Att-LSTM model increase the accuracy much greater than the attention-based models on the Number questions. Furthermore, the proposed method is further improved the performance for the Object questions. Two main reasons cause the result: 1) the question-guided attention network focuses on the important region which relates to the question; 2) the semantic-guided-attention provide the attribute information and the object noun is the main element for the attribute set. All the results show that the proposed method outperforms almost all the contrast model on all types of questions.
IV-C3 Results on VQA Dataset
We compare our method with several state-of-the-art methods on the VQA dataset. For equality, the per-answer category accuracy and overall accuracy of the models are shown in Table V. The compared models are divided into four categories: i) none attribute and attention models, ii) only attribute-based models, iii) only attention-based models and iv) both the semantic-guided attention and attention-guided attention models (every category is separated with double horizontal line in Table V). It is observed that our model get almost the best performance on all category questions. Only the MLAN model  achieves a comparable results with ours. Two reasons for such a result. Firstly, the MLAN model uses two level attention networks, the visual attention which is benefit for fine-grained spatial inference and the semantic attention which reduces the semantic gap. Secondly, the MLAN model use the ResNet  as image information extractor, which is more powerful than the VGG-Net. However, our model is not inferior to it. To get a more fair comparison (i.e., to eliminate the influence of the visual features), ResNet-152 is also used as image feature extractor (Ours (ResNet) in Table V). Specially, feature maps output from the last convolutional layer of the ResNet-152 are used as visual features. So each image is cropped into 49 regions and each region is represented as a 2048-dimension vector. Compared with MLAN, almost for all types of questions our gets the highest scores. It shows that with the same image feature representation, our model shows a better performance than MLAN. Furthermore, there are two differences between MLAN and ours: (a) our approach is designed for both image captioning and visual questioning answering, while MLAN only used for VQA; (b) the semantic-guided attention in our approach is a dual structure, i.e., the semantic-guided attention network is used not only to find the most attribute-related image regions, but also to find the most region-related attributes. While the semantic attention network in MLAN is used to find the attributes coressponding to the question. The experimental results shown in Table V demonstrates that the dual structure of the semantic-guided attention is efficiency for VQA task. All results in Table IV and V confirmed that the semantic information and the attention network are vital to VQA task.
Fig. 6 shows some typical examples on the VQA validation subset. The letter Q denotes the corresponding questions, “Ours” and “LSTM” represent the answers generated by our model and the VIS+LSTM proposed in reference , respectively. According to the instances shown in Fig. 6, we can find that our model shows a much better performance than the VIS+LSTM, especially on the number questions. For example, the second instance of the second row, two cats are in the photo, but the VIS+LSTM only finds one. It mainly because the proposed model contains two level attention networks which helps the model focus on the most related regions for the question. Moreover, when answering the object attribute s question, the proposed model gets more accurate answers than VIS + LSTM. For example, the forth image in the first row show a big football field. Obviously, the football field is green, while the VIS+LSTM judges the grass is white. The examples show that our model is effective for the VQA task.
|Model||Object||Number||Color||Location||Accuracy||[email protected]||[email protected]|
IV-C4 Ablation Study
To further verify the effectiveness of each component in our model, we perform ablation studies by ablating certain components:
None-attention (None-Att). Only the image representation and question representation are used for VQA problem ,without the semantic-guided attention network and question-guided attention network.
Question-guided attention (QA). Only the question-guided attention network are used.
Semantic-guided attention (SA). Only the semantic-guided attention network are used.
Question and semantic-guided attention (FULL). All attention networks are used for VQA.
Table VI shows the performance of the ablation models on the COCO-QA dataset. The results are similar to those in Table III, so we can get a similar conclusion: 1) the question-guided attention network is effective in finding question-related regions and 2) the semantic-guided attention network provide more important supplement information which helps the model get the correct answers.
We propose a novel model based on attributes and attention mechanism for V2L problems. The model concludes two level attention networks. The text-guided attention network enables subtle understanding between vision and language, and the semantic-guided attention network provides high-level concepts information and explores the subtle relationships between concepts and regions which reduces the gap between language and visual information. Our model makes full use of the complementarity of the different level visual representations. The extensive experiments both for image captioning and visual question answering show that our model outperforms any single visual attention or attribute model. The semantic attention network is an important supplement for text-guided attention.
-  (2015) VQA: visual question answering. In Proceedings of IEEE International Conference on Computer Vision, pp. 2425–2433. Cited by: §IV-A, TABLE V.
-  (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Vol. 29, pp. 65–72. Cited by: §IV-B1.
-  (2017) Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666. Cited by: §I.
-  (2015) ABC-CNN: an attention based convolutional neural network for visual question answering. CoRR abs/1511.05960. Cited by: TABLE IV.
-  (2019) Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE Transactions on Image Processing 28 (1), pp. 265–278. Cited by: §II.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734. Cited by: §III.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634. Cited by: §I, §II-A, TABLE II.
-  (2013) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intellengence, pp. 2321–2334. Cited by: §II-A, §IV-B2, TABLE I, TABLE II.
-  (2015) Are you talking to a machine? dataset and methods for multilingual image question. In Proceedings of Advances in Neural Information Processing Systems, pp. 2296–2304. Cited by: §II-B.
-  (2018) CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Transactions on Cybernetics 48 (11), pp. 3171–3183. Cited by: §II.
-  (2018) A unified metric learning-based framework for co-saliency detection. IEEE Transaction on Circuits System Video Technology 28 (10), pp. 2473–2483. Cited by: §II.
-  (2016) Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §IV-C3.
-  (2013) Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47, pp. 853–899. Cited by: §IV-A.
-  (2015) Framing image description as a ranking task: data, models and evaluation metrics (extended abstract). In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 4188–4192. Cited by: §IV-A.
-  (2017) TGIF-QA: toward spatio-temporal reasoning in visual question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2758–2766. Cited by: §I.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. Cited by: §I, §II-A, §IV-B1, §IV-B2, TABLE I, TABLE II.
-  (2017) Show, ask, attend, and answer: A strong baseline for visual question answering. CoRR abs/1704.03162. Cited by: §II-B.
-  (2013) BabyTalk: understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intellengence, pp. 2891–2903. Cited by: §II-A.
-  (2017) MAM-RNN: multi-level attention model based RNN for video captioning. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2208–2214. Cited by: §I.
-  (2014) Microsoft coco: common objects in context. In Proceedings of European Conference on Computer Vision, pp. 740–755. Cited by: §III-A, §IV-A.
-  (2017) Attention correctness in neural image captioning. In Proceedings of AAAI Conference on Artificial Intelligence, pp. 4176–4182. Cited by: §I.
-  (2016) Hierarchical question-image co-attention for visual question answering. In Proceedings of Advances in Neural Information Processing Systems, pp. 289–297. Cited by: §II-B, §IV-C2, §IV-C2, TABLE IV, TABLE V.
-  (2017) Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing PP, pp. 1–13. Cited by: §I.
-  (2016) Learning to answer questions from image using convolutional neural network. In Proceedings of AAAI Conference on Artificial Intelligence, pp. 3567–3573. Cited by: §II-B.
-  (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of Advances in Neural Information Processing Systems, pp. 1682–1690. Cited by: §II-B.
-  (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In Proceedings of International Conference on Learning Representations, Cited by: §I, §II-A, §IV-B2, TABLE I, TABLE II.
-  (2017) Dual attention networks for multimodal reasoning and matching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2156–2164. Cited by: §II-B, TABLE V.
-  (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 30–38. Cited by: §II-B, TABLE IV, TABLE V.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on Association for Computational Linguistics, pp. 311–318. Cited by: §IV-B1.
-  (2016) Deep semantic understanding of high resolution remote sensing image. In Proceedings of International Conference on Computer, Information and Telecommunication Systems, pp. 1–5. Cited by: §I.
-  (2015) Exploring models and data for image question answering. In Proceedings of Advances in Neural Information Processing Systems, pp. 2953–2961. Cited by: §I, §II-B, §III-E, Fig. 6, §IV-A, §IV-C3, TABLE IV.
-  (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §I.
-  (2016) Where to look: focus regions for visual question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621. Cited by: §I, §II-B.
-  (2014) Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia 21 (2), pp. 42–70. Cited by: §II-B.
-  (2015) CIDEr: consensus-based image description evaluation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. Cited by: §IV-B1.
-  (2015) Show and tell: A neural image caption generator. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §I, §II-A, §III-D, §IV-B2, TABLE I, TABLE II.
-  (2016) What value do explicit high level concepts have in vision to language problems?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 203–212. Cited by: §I-A, §I-A, §II-A, §II-B, TABLE II, TABLE IV, TABLE V.
-  (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, pp. 853–899. Cited by: §I-A, §II-A, §IV-B2, TABLE I.
-  (1994) Verbs semantics and lexical selection. In Proceedings on Association for Computational Linguistics, pp. 133–138. Cited by: §IV-C1.
-  (2016) Dynamic memory networks for visual and textual question answering. In Proceedings of International Conference on Machine Learning, pp. 2397–2406. Cited by: §I, §II-B, §III-B, §III-B.
-  (2017) Attend to you: personalized image captioning with context sequence memory networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 895–903. Cited by: §I.
-  (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII, pp. 451–466. Cited by: §II-B, TABLE V.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning, pp. 2048–2057. Cited by: §II-A, §III-D2, §IV-B2, TABLE I, TABLE II.
-  (2016) Stacked attention networks for image question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29. Cited by: §II-B, TABLE IV, TABLE V.
-  (2016) Image captioning with semantic attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659. Cited by: §II-A, §IV-B2, TABLE I, TABLE II.
-  (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: §IV-A.
-  (2017) Multi-level attention networks for visual question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4187–4195. Cited by: §I-A, §I-A, §II-B, §III-A, §IV-C3, TABLE V.
-  (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the International Conference on Machine Learning, Cited by: §IV-A.
-  (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3713–3721. Cited by: §I.
-  (2017) Hierarchical recurrent neural network for video summarization. In Proceedings of the ACM on Multimedia Conference, pp. 863–871. Cited by: §I.
Xuelong Li is a full professor with School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, P.R. China.
Aihong Yuan is currently pursuing the Ph.D. degree with the Key Laboratory of Spectral Imaging Technology CAS, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China. His research interests include image/video content understanding and deep learning.
Xiaoqiang Lu is a Full Professor with the Key Laboratory of Spectral Imaging Technology CAS, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China.