### Abstract

Given the features of a video, recurrent neural network can be used toautomatically generate a caption for the video. Existing methods for videocaptioning have at least three limitations. First, semantic information hasbeen widely applied to boost the performance of video captioning models, butexisting networks often fail to provide meaningful semantic features. Second,Teacher Forcing algorithm is often utilized to optimize video captioningmodels, but during training and inference, different strategies are applied toguide word generation, which lead to poor performance. Third, current videocaptioning models are prone to generate relatively short captions, whichexpress video contents inappropriately. Towards resolving these three problems,we make three improvements correspondingly. First of all, we utilize bothstatic spatial features and dynamic spatio-temporal features as input forsemantic detection network (SDN) in order to generate meaningful semanticfeatures for videos. Then, we propose a scheduled sampling strategy whichgradually transfers the training phase from a teacher guiding manner towards amore self teaching manner. At last, the ordinary logarithm probability lossfunction is leveraged by sentence length so that short sentence inclination isalleviated. Our model achieves state-of-the-art results on the Youtube2Textdataset and is competitive with the state-of-the-art models on the MSR-VTTdataset.

### Quick Read (beta)

# [

###### Abstract

## 1

Given the features of a video, recurrent neural network can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, which lead to poor performance. Third, current video captioning models are prone to generate relatively short captions, which express video contents inappropriately. Towards resolving these three problems, we make three improvements correspondingly. First of all, we utilize both static spatial features and dynamic spatio-temporal features as input for semantic detection network (SDN) in order to generate meaningful semantic features for videos. Then, we propose a scheduled sampling strategy which gradually transfers the training phase from a teacher guiding manner towards a more self teaching manner. At last, the ordinary logarithm probability loss function is leveraged by sentence length so that short sentence inclination is alleviated. Our model achieves state-of-the-art results on the Youtube2Text dataset and is competitive with the state-of-the-art models on the MSR-VTT dataset.

## 2 Keywords:

video captioning, scheduled sampling, sentence-length-modulated loss, semantic assistance, RNN

Semantics-Assisted Video Captioning]A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Haoran Chen et al.]Haoran Chen ${}^{1}$, Ke Lin ${}^{2}$, Alexander Maye ${}^{3}$, Jianming Li ${}^{1}$ and Xiaolin Hu ${}^{1,*}$

## 3 Introduction

Video captioning aims to automatically generate a concise and accurate description for a video. It requires techniques both in computer vision (CV) and in natural language processing (NLP). Deep learning (DL) methods for sequence-to-sequence learning are able to learn the map from discrete color arrays to dense vectors which is utilized to generate natural language sequences without the interference of human. Those methods have produced impressive results on this task compared with the results yielded by manually crafted features.

It has gained increasingly attention in video captioning that semantic meaning of a video is critical and beneficial for an RNN to generate annotations (Pan et al., 2016; Gan et al., 2017). And keeping semantic consistency between video content and video description helps to refine a generated sentence in semantic richness (Gao et al., 2017). But few researchers explore the methods to obtain video semantic features, the metrics to measure the quality of it and the correlation between video captioning performance and meaningfulness of semantic features.

Several training strategies have been used to optimize video captioning models, such as the Teacher Forcing algorithm and CIDEnt-RL. Teacher Forcing algorithm is a simple and intuitive way to train RNN. But it suffers from the discrepancy between training which utilizes ground truth to guide word generation at each step and inference which samples from the model itself at each step. RL techniques have also been adopted to improve the training process of video captioning. CIDEnt-RL is one of the best RL algorithms (Pasunuru and Bansal, 2017b) but it is extremely time-consuming to calculate metrics for every batch. In addition, the improvement on different metrics is unbalanced. In another word, the improvements on other metrics are not as large as that on the specific metrics optimized directly.

The commonly used loss function for video captioning is comprised of logarithm of probabilities of target correct words (Venugopalan et al., 2015; Donahue et al., 2015). A long sentence tends to bring high loss to the model for each additional word reduces the joint probability by roughly at least one order of magnitude. In contrast, a short sentence with few words has relatively low loss. Thus a video captioning model is prone to generate short sentences after optimized by a log likelihood loss function. Excessively short annotations may neither be able to describe a video accurately nor express the content of a video in a rich language.

We propose to improve video captioning task in three aspects. Firstly, we build our semantic detection network (SDN) on top of two streams: the first one is 2D ConNet, which is supposed to capture the static visual features of a video, and the second one is 3D ConvNet, which is intended to extract the dynamic visual information. Consequently, SDN is able to produce more meaningful semantic features for a video. Secondly, we take advantage of scheduled sampling method to train our video captioning model, which searches extreme points in the RNN state space more extensively as well as bridges the gap between training process and inference (Bengio et al., 2015). Thirdly, we optimize our model by a sentence-length-modulated loss function which encourages the model to generate longer captions with more detail.

Our implementation, available on GitHub, is based on the TensorFlow deep learning framework.

## 4 Related Works

### 4.1 Image Captioning

Encoder-decoder paradigm has been widely applied by researchers in image captioning since it was introduced to machine translation (Cho et al., 2014). It becomes a mainstream method in both image captioning and machine translation (Vinyals et al., 2015; Mao et al., 2014). Inspired by successful attempts to employ attention in machine translation (Bahdanau et al., 2015) and object detection (Ba et al., 2015), models that is able to attend key elements in an image are investigated for the purpose of generating high-quality image annotations. Semantic features (You et al., 2016) and object features (Anderson et al., 2018) are incorporated into attention mechanism as heuristic information to guide selective and dynamic attendance of salient segments in images. Reinforcement learning techniques, which optimize specific metrics of a model directly, are also adopted to enchance the performance of image captioning models (Rennie et al., 2017). Graph Convolutional Networks (GCN) is introduced to cooperate with RNN to integrate both semantic and spatial information into image encoder in order to generate excellent representations of image (Yao et al., 2018).

### 4.2 Video Captioning

Though both image captioning and video captioning are multi-modal tasks, video captioning is probably harder than the former one for video consists of not only spatial feature but also temporal correlation.

Following the successful adoption of encoder-decoder paradigm in image captioning, multi-modal features of video are fed into sequence-to-sequence model to generate video description with the assistance of pretrained models in image classification (Venugopalan et al., 2015; Donahue et al., 2015). In order to alleviate the semantic inconsistency between the video content and the generated caption, visual features and semantic features of a video are mapped to a common embedding space so that semantic consistency may be achieved by minimizing the Euclidean distance between two embedded features (Pan et al., 2016).

RNN, especially LSTM, can be extended by integrating high-level tag or attribution of video with visual features of video through embedding and element-wise addition/multiplication operation (Gan et al., 2017). Yu et al. (2016) exploit a sentence generator which is built upon a RNN module to model language, a multi-modal layer to integrate different modal information and an attention module to dynamically select salient features from input. And the output of a sentence generator is fed into a paragraph generator for describing a relatively long video with several sentences.

Following the attention mechanism introduced by Xu et al. (2015), Gao et al. (2017) capture the salient structure of video with the help of visual features of video and context information provided by LSTM. Though bottom-up (Anderson et al., 2018) and top-down attention (Ramanishka et al., 2017) are proposed for image captioning, selectively focusing on salient regions in image is, to some extent, similar to picking key frames in video (Chen et al., 2018). Wang et al. (2018a) explore corss-modal attention at different granularity and captures global temporal structures as well as local temporal structures encompassed in multi-modal features to assist the generation of video captions.

Due to insufficiency of labeled video data and abundance of unlabeled video data, Pasunuru and Bansal (2017a) and Sun et al. (2019) propose to improve video captioning with self-supervised learning tasks or unsupervised learning tasks, such as unsupervised video prediction, entailment generation and text-to-video generation. Pasunuru and Bansal (2017a) demonstrate that multi-task training contributes to sharing knowledge across different domains and each task, including video captioning, benefits from the training of other irrelevant tasks. Sun et al. (2019) take advantage of abundance of unlabeled videos on YouTube and trains the BERT model introduced in (Devlin et al., 2018) on comparably large-scale video, which then is used as a feature extractor for video captioning. By composing different experts on different known activities, Wang et al. (2018b) take advantage of external textual corpus and transfers known knowledge to unseen data implicitly for zero-shot video captioning.

### 4.3 RNN Training Strategy

The traditional method to train a RNN is the Teacher Forcing algorithm (Williams and Zipser, 1989) which feeds human annotations to RNN as input at each step to guide the token generation in training and samples a token from the model itself as input during inference. The different sources of input tokens during training and inference leads to the inability of the model to generate high-quality tokens in inference as errors may accumulate along the sequence generation.

Bengio et al. (2015) propose to switch gradually from guiding generation by true tokens to feeding sampled tokens during training which helps RNN model adapt to the inference scheme in advance. It has been applied to image captioning and speech recognition. Inspired by (Huszar, 2015) which mathematically proves that both Teacher Forcing algorithm and Curriculum Learning have a tendency to learn a biased model, Goyal et al. (2016) solve the problem by adopting adversarial domain method to align the dynamics of RNN in training and inference.

Inspired by the successful application of RL methods in image captioning (Rennie et al., 2017), Pasunuru and Bansal (2017b) propose a modified reward, which compensates the logical contradiction in phrase-matching metrics, as direct optimization target in video captioning. The gradient of non-differentiable RL loss function is computed and back-propagated by REINFORCEMENT algorithm (Williams, 1992). But calculation of reward for each training batch adds non-negligible computation cost to training process and slow down optimization progress. In addition, the improvements of RL method on miscellaneous metrics are not comparable with the improvement on the specific metrics used as RL reward.

## 5 The Proposed Approaches

We consider video captioning task as a supervised task. The training set is annotated as $N$ pairs of $\{{\mathbf{X}}_{i},{\mathbf{Y}}_{i}\}$, where ${\mathbf{X}}_{i}$ denotes a video and ${\mathbf{Y}}_{i}$ represents the corresponding target caption. Suppose there are $M$ frames from a video and a caption consists of ${L}_{i}$ words, then we have (5).

${\mathbf{X}}_{i}$ | $=\{{\mathbf{x}}_{i,0},{\mathbf{x}}_{i,1},\mathrm{\dots},{\mathbf{x}}_{i,M-1}\},$ | (1) | ||

${\mathbf{Y}}_{i}$ | $=\{{\mathbf{y}}_{i,0},{\mathbf{y}}_{i,1},\mathrm{\dots},{\mathbf{y}}_{i,{L}_{i}-1}\},$ |

where each $\mathbf{x}$ denotes a single frame and each $\mathbf{y}$ denotes a word belonging to a fixed known dictionary.

A pretrained model is used to produce word embeddings. And then we obtain a low-dimension embedding of a caption ${\mathbf{Y}}_{i}\in {\mathbb{R}}^{{L}_{i}\times {D}_{w}}$,

$${\mathbf{Y}}_{i}={({\mathbf{w}}_{i,0},{\mathbf{w}}_{i,1},\mathrm{\dots},{\mathbf{w}}_{i,{L}_{i}-1})}^{T},{\mathbf{w}}_{i,j}\in {\mathbb{R}}^{{D}_{w}},$$ | (2) |

where ${D}_{w}$ is the dimension of word embedding space.

### 5.1 Encoder-Decoder Paradigm

#### 5.1.1 Encoder

Our encoder is composed of 3D ConvNet, 2D ConvNet and semantic detection network (SDN). 3D ConvNet is utilized to produce spatio-temporal feature ${\mathbf{e}}_{i}\in {\mathbb{R}}^{{D}_{e}}$ for the $i$th video. 2D ConvNet is supposed to find the static visual feature ${\mathbf{r}}_{i}\in {\mathbb{R}}^{{D}_{r}}$ for the $i$th video. At last, the visual spatio-temporal representation of the $i$th video can be obtained by concatenating two features together (5.1.1).

$${\mathbf{v}}_{i}=\left(\begin{array}{c}\hfill {\mathbf{r}}_{i}\hfill \\ \hfill {\mathbf{e}}_{i}\hfill \end{array}\right)\in {\mathbb{R}}^{{D}_{v}},$$ | (3) |

where ${D}_{v}={D}_{e}+{D}_{r}$.

For semantic detection, we manually select the $K$ most common words from both the training set and the validation set as candidate tags for all the videos. The semantic detection task is treated as a multi-label classification task with ${\mathbf{v}}_{i}$ as the representation of the $i$th video and ${\mathbf{y}}_{i}=\{{y}_{i,0},{y}_{i,1},\mathrm{\dots},{y}_{i,K-1}\}\in {\{0,1\}}^{K}$ as the ground truth. If the $j$th tag exists in the annotations of the $i$th video, then ${y}_{i,j}=1$, otherwise, ${y}_{i,j}=0$. Suppose ${\mathbf{s}}_{i}$ is the semantic feature of the $i$th video. Then we have ${\mathbf{s}}_{i}=\sigma (f({\mathbf{v}}_{i}))\in {(0,1)}^{K}$, where $f(\cdot )$ is a nonlinear mapping and $\sigma (\cdot )$ is sigmoid activation function. A relatively deep multi-layer perceptron (MLP) on top of two-stream framework is exploited to simulate the nonlinear projection. And SDN is trained by minimizing the loss function (5.1.1).

$$L({\mathbf{s}}_{i},{\mathbf{y}}_{i})=\frac{1}{N}\sum _{j=0}^{K-1}{y}_{i,j}\mathrm{log}{s}_{i,j}+(1-{y}_{i,j})\mathrm{log}(1-{s}_{i,j})$$ | (4) |

A probability distribution of tags ${\mathbf{s}}_{i}$ is produced by SDN to represent the semantic content of the $i$th video in the training set, the validation set or the test set.

#### 5.1.2 Decoder

Standard RNN (Elman, 1990) is capable of learning temporal patterns along input sequences. But it suffers from gradient vanishing/explosion problem which results in its inability to generalize to long sequences. LSTM (Hochreiter and Schmidhuber, 1997) is a prevailing variant of RNN which alleviates the long-term dependency problem by using gates to update cell state but it ignores the semantic information of the input sequence. We use SCN (Gan et al., 2017), a variant of LSTM, as our decoder for it not only avoids the long-term dependency problem but also takes advantage of semantic information of the input video. Suppose we have video feature $\mathbf{v}$, semantic feature $\mathbf{s}$, input vector ${\mathbf{x}}_{t}$ at time step $t$ and hidden state ${\mathbf{h}}_{t-1}$ at time step $t-1$ . SCN integrates semantic information $\mathbf{s}$ into $\mathbf{v}$, ${\mathbf{x}}_{t}$ and ${\mathbf{h}}_{t-1}$ respectively and obtains the semantics-related video feature $\widehat{\mathbf{v}}$, the semantics-related input ${\widehat{\mathbf{x}}}_{t}$ and the semantics-related hidden state ${\widehat{\mathbf{h}}}_{t-1}$ (5.1.2).

${\widehat{\mathbf{x}}}_{z,t}$ | $={\mathbf{W}}_{z,c}\cdot (({\mathbf{W}}_{z,a}\cdot {\mathbf{x}}_{t})\odot ({\mathbf{W}}_{z,b}\cdot \mathbf{s})),z\in \{c,i,f,o\},$ | (5) | ||

${\widehat{\mathbf{v}}}_{z}$ | $={\mathbf{C}}_{z,c}\cdot (({\mathbf{C}}_{z,a}\cdot \mathbf{v})\odot ({\mathbf{C}}_{z,b}\cdot \mathbf{s})),z\in \{c,i,f,o\},$ | |||

${\widehat{\mathbf{h}}}_{z,t-1}$ | $={\mathbf{U}}_{z,c}\cdot (({\mathbf{U}}_{z,a}\cdot {\mathbf{h}}_{t-1})\odot ({\mathbf{U}}_{z,b}\cdot \mathbf{s})),z\in \{c,i,f,o\},$ |

where $c$, $i$, $f$ and $o$ denote cell state, input gate, forget gate and output gate respectively.

Then input gate ${\mathbf{i}}_{t}$, forget gate ${\mathbf{f}}_{t}$ and output gate ${\mathbf{o}}_{t}$ at time step $t$ are calculated respectively in a way similar to the standard LSTM (5.1.2).

${\mathbf{i}}_{t}$ | $=\sigma ({\widehat{\mathbf{x}}}_{i,t}+{\widehat{\mathbf{h}}}_{i,t-1}+{\widehat{\mathbf{v}}}_{i}+{\mathbf{b}}_{i}),$ | (6) | ||

${\mathbf{f}}_{t}$ | $=\sigma ({\widehat{\mathbf{x}}}_{f,t}+{\widehat{\mathbf{h}}}_{f,t-1}+{\widehat{\mathbf{v}}}_{f}+{\mathbf{b}}_{f}),$ | |||

${\mathbf{o}}_{t}$ | $=\sigma ({\widehat{\mathbf{x}}}_{o,t}+{\widehat{\mathbf{h}}}_{o,t-1}+{\widehat{\mathbf{v}}}_{o}+{\mathbf{b}}_{o}),$ |

where $\sigma $ denotes logic sigmoid function $\sigma (x)=\frac{1}{1+{e}^{-x}}\in (0,1)$ and $\mathbf{b}$ is a bias term for each gate.

The raw cell state at current step $t$ can be computed as (5.1.2).

$${\widehat{\mathbf{c}}}_{t}=\mathrm{tanh}({\widehat{\mathbf{x}}}_{c,t}+{\widehat{\mathbf{h}}}_{c,t-1}+{\widehat{\mathbf{v}}}_{c}+{\mathbf{b}}_{c}),$$ | (7) |

where $\mathrm{tanh}$ denotes hyperbolic function $\mathrm{tanh}(x)=\frac{{e}^{x}-{e}^{-x}}{{e}^{x}+{e}^{-x}}\in (-1,1)$ and ${\mathbf{b}}_{c}$ is the bias term for cell state. The input gate ${\mathbf{i}}_{t}$ is supposed to control the throughput of the semantic-related input ${\widehat{\mathbf{x}}}_{t}$ and the forget gate ${\mathbf{f}}_{t}$ is designed to determine the preservation of the previous cell state ${\mathbf{c}}_{t-1}$. Thus, we have the final cell state ${\mathbf{c}}_{t}$ at time step $t$ (5.1.2).

$${\mathbf{c}}_{t}={\mathbf{f}}_{t}*{\mathbf{c}}_{t-1}+{\mathbf{i}}_{t}*{\widehat{\mathbf{c}}}_{t}.$$ | (8) |

And then output gate is utilized to control the throughput ratio of the cell state ${\mathbf{c}}_{t}$ so that the cell output ${\mathbf{h}}_{t}$ can be determined by (5.1.2).

$${\mathbf{h}}_{t}={\mathbf{o}}_{t}*\mathrm{tanh}({\mathbf{c}}_{t}).$$ | (9) |

Semantics-related variables ${\widehat{\mathbf{x}}}_{t}$, $\widehat{\mathbf{v}}$, ${\widehat{\mathbf{h}}}_{t-1}$ and ${\widehat{\mathbf{c}}}_{t}$ are dependent on semantic feature $\mathbf{s}$ so that SCN takes semantic information of video into account implicitly. The forget gate ${\mathbf{f}}_{t}$ is a key component in updating ${\mathbf{c}}_{t-1}$ to ${\mathbf{c}}_{t}$ which, in some degree, avoids the long-term dependency problem. Our SCN is slightly different from the one in (Gan et al., 2017). $\mathrm{tanh}(\cdot )$ is utilized to activate the raw cell input which confines it within $(-1,1)$ in our model instead of $\sigma (\cdot )$ in (Gan et al., 2017) which leads to a range of $(0,1)$. In addition, we add a semantics-related video feature term to each recurrent step which is absent from (Gan et al., 2017).

### 5.2 Training Method

In the context of RNN trained with the Teacher Forcing algorithm, the logarithmic probability $P({Y}_{i}|{X}_{i};\mathrm{\Theta})$ of a given pair of input/output/label $({X}_{i},{Y}_{i},{\widehat{Y}}_{i})$ and given model parameters $\mathrm{\Theta}$ can be calculated as (5.2).

$$P({Y}_{i}|{X}_{i};\mathrm{\Theta})=\sum _{t=0}^{{L}_{i}-1}\mathrm{log}P({y}_{i,t}|{\widehat{y}}_{i,0},\mathrm{\cdots},{\widehat{y}}_{i,t-1},{X}_{i};\mathrm{\Theta}),$$ | (10) |

where ${L}_{i}$ is the length of output.

In the case of SCN, the joint logarithmic probability can be computed as follow:

$P({Y}_{i}|{X}_{i};\mathrm{\Theta})$ | $={\displaystyle \sum _{t=0}^{{L}_{i}-1}}\mathrm{log}P({y}_{i,t}|{\widehat{y}}_{i,0},\mathrm{\cdots},{\widehat{y}}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta}),$ | (11) | ||

$={\displaystyle \sum _{t=0}^{{L}_{i}-1}}\mathrm{log}P({y}_{i,t}|{h}_{i,t-1},{c}_{i,t-1},{\widehat{y}}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta}),$ |

where ${h}_{i,t}$, ${c}_{i,t}$, ${s}_{i}$ are the output state, the cell state and the semantic feature of the $i$th video respectively.

To some extent, ${h}_{i,t}$ and ${c}_{i,t}$ can be viewed as the aggregation of all the previous information. We can compute them with recurrence relation (5.2).

${h}_{i,t}=\{\begin{array}{cc}\hfill f({X}_{i},{h}_{i,t-1},{c}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta})& \text{if}t=0,\hfill \\ \hfill f({\widehat{y}}_{i,t-1},{h}_{i,t-1},{c}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta})& \text{if}t0,\hfill \end{array}$ | (12) | ||

${c}_{i,t}=\{\begin{array}{cc}\hfill g({X}_{i},{h}_{i,t-1},{c}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta})& \text{if}t=0,\hfill \\ \hfill g({\widehat{y}}_{i,t-1},{h}_{i,t-1},{c}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta})& \text{if}t0,\hfill \end{array}$ |

where ${h}_{i,-1}=\mathrm{\U0001d7ce}$ , ${c}_{i,-1}=\mathrm{\U0001d7ce}$. In inference, we need to replace ${\widehat{y}}_{i,t}$ with ${y}_{i,t}$ which may lead to the accumulation of prediction errors.

In order to bridge the gap between training and testing in the Teacher Forcing algorithm, we propose to train our video captioning model with scheduled sampling. Scheduled sampling transfers training process gradually from using ground truth words ${\widehat{Y}}_{i}$ for guiding to using sampled words ${Y}_{i}$ for guiding at each recurrent step. The commonly used strategy to sample a word from the output distribution is $\mathrm{arg}\mathrm{max}$. But the search scope is limited to a relatively small part of search space for it always selects a word with the largest probability. For the sake of enlarging the search scope, we draw a word at random from the output distribution as a part of the input for the next recurrent step. In this way, words with higher probabilities are more likely to be chosen. The randomness of the sampling procedure will make the recurrent network be able to explore a relatively large scope of the network state space. And the network is less likely to be stuck in an inferior local minimum. In the perspective of training machine learning model, multinomial sampling strategy reduces overfitting of the network, in another word, it acts like a regularizer.

Our method to optimize the language model consists of two parts: the outer loop is proposed to schedule sampling probability at each recurrent step (Algorithm References) while the algorithm inside of RNN (Algorithm [) specifies the procedure to sample from the output of a model with a given possibility as a part of input for the next step in RNN.

### 5.3 Sentence-length-related Loss Function

What is a good description for video? A good description should be both accurate and concise. In order to achieve this goal, we design a sentence-length-modulated loss function (5.3) for our model.

$$\mathrm{\mathbf{L}\mathbf{o}\mathbf{s}\mathbf{s}}({\widehat{y}}_{i},{s}_{i},{X}_{i};\mathrm{\Theta})=-\sum _{i=0}^{bs-1}\frac{1}{{L}_{i}^{\beta}}\sum _{t=0}^{{L}_{i}-1}\mathrm{log}p({\widehat{y}}_{i,t}|{h}_{i,t-1},{c}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta}),$$ | (13) |

where $bs$ is batch size and $\beta \ge 0$ is a hyper parameter which is used to keep a balance between conciseness and accuracy of generated captions. If $\beta =0$, (5.3) is a loss function commonly used in video captioning task. In this loss function, a long sentence has greater loss than a short sentence. Thus, after minimizing the loss, RNN is inclined to generate relatively short annotations which may be incomplete in semantics or sentence structure. If $\beta =1$, all words in generated captions are treated equally in loss function as well as in the process of optimization, which may lead to redundancy or duplicate words in the process of generating captions.

$$\mathrm{\mathbf{L}\mathbf{o}\mathbf{s}\mathbf{s}}({\widehat{y}}_{i},{s}_{i},{X}_{i};\mathrm{\Theta})=-\sum _{i=0}^{bs-1}\sum _{t=0}^{{L}_{i}-1}\mathrm{log}p({\widehat{y}}_{i,t}|{h}_{i,t-1},{c}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta}).$$ | (14) |

Thus, we have the following optimization problem:

$$\mathrm{\Theta}=\mathrm{arg}\underset{\mathrm{\Theta}}{\mathrm{min}}-\sum _{i=0}^{N-1}\frac{1}{{L}_{i}^{\beta}}\sum _{t=0}^{{L}_{i}-1}\mathrm{log}p({\widehat{y}}_{i,t}|{h}_{i,t-1},{c}_{i,t-1},{s}_{i},{X}_{i};\mathrm{\Theta}),$$ | (15) |

where $N$ is the size of training data and $\mathrm{\Theta}$ is the parameter of our model.

The overall structure of our model is visualized in Figure 1. Our SDN and visual feature extractors in the encoder component shares the same 2D ConvNet and 3D ConvNet in practice.

## 6 Experiments

We evaluate our model on two popular video captioning datasets to show the performance of our approach. And then we compare our results with other existing methods.

### 6.1 Datasets

#### 6.1.1 MSVD

YouTube2Text or MSVD (Guadarrama et al., 2013; Chen and Dolan, 2011), published in 2013, contains 1970 short YouTube video clips and the average length of them is about 10 seconds. We get roughly 40 descriptions for each video. And we follow the dataset split setting used in prior works (Pan et al., 2016; Yu et al., 2016; Gan et al., 2017), in which training dataset contains 1200 clips, validation dataset contains 100 clips and the rest of them belong to testing dataset. We tokenize the captions from the training and validation dataset and obtain around 14000 unique words. 12592 of them are uitilzed for prediction and rest of them are presented by $$. We add a symbol $$ to signal the end of a sentence.

#### 6.1.2 MSR-VTT

MSR-Video to Text (MSR-VTT) (Xu et al., 2016; Pan et al., 2016) is a large-scale video benchmark first presented in 2016. In its first version, MSR-VTT provides 10k short video segments with 200k descriptions in total. Each video segment is described by about 20 independent English sentences. In its second version which was published in 2017, MSR-VTT provides additional 3k short clips as testing set and video clips in its first version are used as training set and validation set. Because of lacking human annotations for the test set in the second version, we perform experiments on its first version. We tokenize and obtain 14071 unique words which appear in the train set and validation set of MSR-VTT 1.0 more than once. 13794 of them are indexed with integer starting at 0 and the rest are represented by $$. $$, which signifies the end of a sentence, is added to the vocabulary of MSR-VTT.

### 6.2 Overall Score

Based on the widely used BLEU, METEOR, ROUGE-L and CIDEr metrics, we propose an overall score (6.2) to evaluate the performance of a language model.

$${\mathbf{S}}_{overall}=\frac{\text{B-4}}{top1(\text{B-4})}+\frac{\text{C}}{top1(\text{C})}+\frac{\text{M}}{top1(\text{M})}+\frac{\text{R}}{top1(\text{R})}\in [0,1],$$ | (16) |

where B-4 denotes BLEU-4, C denotes CIDEr, M denotes METEOR, R represents ROUGE-L and $top1(\cdot )$ denotes the best numeric value on the specific metrics. We presume that BLEU-4, CIDEr, METEOR and ROUGE-L reflect one particular aspect of the performance of a model respectively. And we first normalize each metrics value of a model and then take the mean value of them as an overall measurement for that model (6.2). If the result of a model on each metrics is closer to the best result of all models, the overall score will be close to 1. If and only if a model has the start-of-the-art performance on all metrics, the overall score is 1. If a model is much lower than the state-of-the-art result on each metrics, the overall score of the model will be close to 0.

### 6.3 Training Details

Our visual feature consists of two parts: static visual feature and dynamic visual feature. ResNeXt (Xie et al., 2017), which is pretrained on ImageNet ILSVRC2012 dataset, is utilized as the static visual feature extractor in the encoder of our model. And ECO (Zolfaghari et al., 2018), which is pretrained on Kinetics-400 dataset, is utilized as dynamic visual feature extractor for the encoder in our model. More specifically, we take the 2048-dimension average pooling feature vector of the conv5/block3 output of ResNeXt as the 2D representation of videos and take the 1536-way feature of the global pool in ECO as 3D representation of videos. We set initial learning rate as $2\times {10}^{-4}$ for MSVD while $4\times {10}^{-4}$ for MSR-VTT. In addition, we drop learning rate by 0.316 every 20350 steps for MSR-VTT. Batch size is set to 64 and Adam algorithm is utilized to optimize the model for both datasets. The hyper parameter ${\beta}_{1}$ is set to 0.9, ${\beta}_{2}$ is set to 0.999 and $\u03f5$ is set to $1\times {10}^{-8}$ for Adam algorithm. Each model is trained for 50 epochs in which the hyper parameter sample probability $\u03f5$ is set as $ep//10\times 0.1$ for the $ep$th epoch. We fine-tune the hyper parameters of our model on the validation sets and select the best checkpoint for testing according to the overall score of evaluation on the validation set.

### 6.4 Comparison with the State-of-the-Art Models

Empirically, we evaluate our method on Youtube2Text/MSVD (Guadarrama et al., 2013) and MSR-VTT (Xu et al., 2016). We report the results of our model along with the many existing models in Table 1 and Table 2.

Table 1 displays the performance of several models on MSVD. LSTM-E (Pan et al., 2016) makes use of VGG and C3D as visual feature extractors. In LSTM-E, a jointly embedding component is utilized to bridge the gap between visual information and sentence content. h-RNN is composed of a sentence generator and a paragraph generator. The sentence generator of h-RNN exploits temporal-spatial attention mechanism to focus on key segments during generation. The paragraph generator of h-RNN captures dependency between different time step outputs of the sentence generator and provides the sentence generator with new initial state. aLSTMs integrates LSTM with attention mechanism to capture the salient elements in video. What’s more, aLSTMs projects the visual feature and generated sentence feature into a common space and keeps the consistency of semantics by minimizing the Euclidean distance between two embedded features. SCN utilizes a semantics-related variant of LSTM as decoder and exploits C3D and ResNet as encoder. MTVC shares the same model on video captioning task, video prediction task and entailment generation task. The model performance on each task is benefited from the other two tasks. MTVC also utilizes attention mechanism and ensemble learning. Autoencoder for visual information and visual-semantic jointly embedding for semantic information are exploited as encoder in SibNet. The decoder of SibNet generates captions for videos with soft attention. As we can infer from Table 1, our method outperforms all the other methods on all the metrics with a large margin. Compared with the previously best results, BLEU-4, CIDEr, METEOR and ROUGE-L are improved by 13.4%, 11.5%, 5.0% and 5.5% respectively. Our model has the highest overall score (6.2).

Table 2 displays the evaluation results of several video captioning models on MSR-VTT. v2t_navigator, Aalto, VideoLAB are the top 3 models in MSR-VTT 2017 challenge. MTVC and SibNet are similar to the ones trained on MSVD. CIDEnt-RL optimizes the model with entailment-enhanced reward (CIDEnt) by reinforcement learning technique. The CIDEr of our method is only 0.3 lower than CIDEnt-RL which directly optimizes CIDEr by RL method. And our method is better than CIDEnt-RL on other metrics by at least 1.6%. HACA exploits a so-called hierarchically aligned cross-modal attention framework to fuse multi-modal features both spatially and temporally. Our model outperforms HACA on all metrics except for METEOR which is lower by 2%. TAMoE takes advantage of external corpus and composes several experts based on external knowledge to generate captions for video. Our model achieves the state-of-the-art results on BLEU-4 and ROUGE-L and has the best result by the weighted average of four metrics (overall score (6.2)).

Our model achieves new state-of-the-art results on both the MSVD dataset and the MSR-VTT dataset which demonstrate the superiority of our method. Note that, our model is only trained on a single dataset without attention mechanism and it is tested without ensemble or beam search.

## 7 Model Analysis

In this section, we will discuss the utility of the three improvements on our model.

Semantic features are the output of a multi-label classification task. Mean average precision (mAP) is often used to evaluate the results of multi-label classification task (Tsoumakas and Katakis, 2007). And we apply it to evaluate the quality of semantic features. Table 3 and Table 4 list the performance of our model trained by scheduled multinomial sampling with different semantic features on MSVD and MSR-VTT respectively. We can clearly infer from them that a better multi-label classification result results in a better video captioning model. Semantic features with higher mAP provide clearer potential attributes of a video for the model. Thus, the model is able to generate better video annotations by considering semantic features, spatio-temporal features and context information comprehensively.

Table 5 and Table 6 show the comparison among the Teacher Forcing algorithm, scheduled sampling by $\mathrm{arg}\mathrm{max}$ strategy and scheduled sampling by multinomial strategy on MSVD and MSR-VTT respectively. Teacher Forcing utilizes human annotations to guide the generation of words during training and samples from the word distribution of the output of the model to direct the generation during inference. $\mathrm{arg}\mathrm{max}$ gradually transfers from teacher forcing way to sample words with the largest possibility from the model itself during training. Multinomial is close to $\mathrm{arg}\mathrm{max}$ but samples words at random from the distribution of the model at each step. As we can see from the Tables 3 and Table 4, scheduled sampling with multinomial strategy has better performance than teacher forcing method and scheduled sampling with $\mathrm{arg}\mathrm{max}$ strategy especially on MSR-VTT. Our method explores RNN state space in larger scope and thus, is likely to find a lower local minimum during training.

As demonstrated by Table 7, the average length of human annotations is larger than all the models with $\beta =\{0,0.7,1\}$ (5.3) respectively. But Figure 2 displays the tendency of redundancy in captions generated by $\beta =1$ model, which deteriorates the overall quality of model-generated sentences. The average caption length of the model with $\beta =0.7$ is greater than the model with $\beta =0$ while smaller than the model with $\beta =1$. The model with $\beta =0.7$ generates relatively long annotations for videos without suffering from redundancy or duplication of words.

## 8 Conclusion

We make three improvements on the video captioning task. Firstly, our SDN extracts high-quality semantic features for video which contributes to the success of our semantics-assisted model. And then, we apply scheduled sampling training strategy. At last, a sentence-length-modulated loss function is proposed to keep our model in a balance between language redundancy and conciseness. Our method achieves satisfying results which is superior to the previous state-of-the-art results on the MSVD dataset. And performance of our model is comparable to the state-of-the-art models on the MSR-VTT dataset. In future, we may obtain further improvement on video captioning by integrating spatio-temporal attention mechanism with visual-semantics features.

## Conflict of Interest Statement

Author Ke Lin is employed by company Samsung. All other authors declare no competing interests.

## Author Contributions

HC designs and performs the experiments. HC, JL and XH analyze experiment results and writes this article. KL and AM analyze data and polish the article.

## Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700904, in part by the National Natural Science Foundation of China under Grant Grant 61621136008, in part by the German Research Council (DFG) under Grant TRR-169, and in part by Sumsung under contract NO. 20183000089.

## Acknowledgments

The authors thank Han Liu, Hallbjorn Thor Gudmunsson and Jing Wen for valuable and insightful discussions.

## Data Availability Statement

The Youtube2Text dataset analyzed for this study can be found in the Collecting Multilingual Parallel Video Descriptions Using Mechanical Turk. The MSR-VTT dataset analyzed for this study can be found in the The 1st Video to Language Challenge.

## References

- Bottom-up and top-down attention for image captioning and visual question answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 6077–6086. External Links: Link, Document Cited by: §4.1, §4.2.
- Multiple object recognition with visual attention. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
- Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1171–1179. External Links: Link Cited by: §3, §4.3.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011), Portland, OR. Cited by: §6.1.1.
- Less is more: picking informative frames for video captioning. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pp. 367–384. External Links: Link, Document Cited by: §4.2.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724–1734. External Links: Link Cited by: §4.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §4.2.
- Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 2625–2634. External Links: Link, Document Cited by: §3, §4.2.
- Finding structure in time. Cognitive Science 14 (2), pp. 179–211. Cited by: §5.1.2.
- Semantic compositional networks for visual captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1141–1150. External Links: Link, Document Cited by: §3, §4.2, §5.1.2, §5.1.2, §6.1.1, Table 1.
- Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia 19 (9), pp. 2045–2055. Cited by: §3, §4.2, Table 1.
- Professor forcing: A new algorithm for training recurrent networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4601–4609. External Links: Link Cited by: §4.3.
- YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp. 2712–2719. External Links: Link, Document Cited by: §6.1.1, §6.4.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §5.1.2.
- How (not) to train your generative model: scheduled sampling, likelihood, adversary?. CoRR abs/1511.05101. External Links: Link, 1511.05101 Cited by: §4.3.
- SibNet: sibling convolutional encoder for video captioning. In 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22-26, 2018, pp. 1425–1434. External Links: Link, Document Cited by: Table 1, Table 2.
- Explain images with multimodal recurrent neural networks. CoRR abs/1410.1090. External Links: Link, 1410.1090 Cited by: §4.1.
- Jointly modeling embedding and translation to bridge video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4594–4602. External Links: Link, Document Cited by: §3, §4.2, §6.1.1, §6.1.2, §6.4, Table 1.
- Multi-task video captioning with video and entailment generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1273–1283. External Links: Link, Document Cited by: §4.2, Table 1, Table 2.
- Reinforced video captioning with entailment rewards. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 979–985. External Links: Link Cited by: §3, §4.3, Table 2.
- Top-down visual saliency guided by captions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 3135–3144. External Links: Link, Document Cited by: §4.2.
- Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1179–1195. External Links: Link, Document Cited by: §4.1, §4.3.
- VideoBERT: A joint model for video and language representation learning. CoRR abs/1904.01766. External Links: Link, 1904.01766 Cited by: §4.2.
- Multi-label classification: an overview. International Journal of Data Warehousing and Mining 3 (3), pp. 1–13. Cited by: §7.
- Sequence to sequence - video to text. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 4534–4542. External Links: Link, Document Cited by: §3, §4.2.
- Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 3156–3164. External Links: Link, Document Cited by: §4.1.
- Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp. 795–801. External Links: Link Cited by: §4.2, Table 2.
- Learning to compose topic-aware mixture of experts for zero-shot video captioning. CoRR abs/1811.02765. External Links: Link, 1811.02765 Cited by: §4.2, Table 2.
- A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1 (2), pp. 270–280. Cited by: §4.3.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. Cited by: §4.3.
- Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5987–5995. External Links: Link, Document Cited by: §6.3.
- MSR-VTT: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 5288–5296. External Links: Link, Document Cited by: §6.1.2, §6.4.
- Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 2048–2057. External Links: Link Cited by: §4.2.
- Exploring visual relationship for image captioning. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pp. 711–727. External Links: Link, Document Cited by: §4.1.
- Image captioning with semantic attention. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4651–4659. External Links: Link, Document Cited by: §4.1.
- Video paragraph captioning using hierarchical recurrent neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4584–4593. External Links: Link, Document Cited by: §4.2, §6.1.1, Table 1.
- ECO: efficient convolutional network for online video understanding. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, pp. 713–730. External Links: Link, Document Cited by: §6.3, Table 1.

[H] \[email protected]@algorithmic[1] \REQUIRE$EPOCH$: max epoch number, $STEPS\mathrm{\_}PER\mathrm{\_}EPOCH$: steps per epoch, $\mathrm{\mathbf{f}\mathbf{e}\mathbf{a}\mathbf{t}\mathbf{u}\mathbf{r}\mathbf{e}}$: necessary features \STATE$\u03f5list\leftarrow \mathit{\text{generate\_epsilon}}()$ \COMMENTGenerate $epsilon$ for each epoch by a predeterminate strategy. \STATE$\mathrm{\mathbf{o}\mathbf{u}\mathbf{t}\mathbf{p}\mathbf{u}\mathbf{t}}\leftarrow \mathrm{\U0001d7ce}$ \FOR$i=0$ \TO$EPOCH$ \FOR$j=0$ \TO$STEPS\mathrm{\_}PER\mathrm{\_}EPOCH$ \STATE${\mathrm{\mathbf{o}\mathbf{u}\mathbf{t}\mathbf{p}\mathbf{u}\mathbf{t}}}_{i,j}\leftarrow \text{\mathit{f}\mathit{u}\mathit{n}\mathit{c}\mathit{t}\mathit{i}\mathit{o}\mathit{n}}({\mathrm{\mathbf{f}\mathbf{e}\mathbf{a}\mathbf{t}\mathbf{u}\mathbf{r}\mathbf{e}}}_{i,j},\u03f5list[i])$ \COMMENTRun RNN \STATEoptimize the network with an optimizer \STATEextend $\mathrm{\mathbf{o}\mathbf{u}\mathbf{t}\mathbf{p}\mathbf{u}\mathbf{t}}$ with ${\mathrm{\mathbf{o}\mathbf{u}\mathbf{t}\mathbf{p}\mathbf{u}\mathbf{t}}}_{i,j}$ \ENDFOR\ENDFOR\RETURN$\mathrm{\mathbf{o}\mathbf{u}\mathbf{t}\mathbf{p}\mathbf{u}\mathbf{t}}$

[H] \[email protected]@algorithmic[1] \REQUIRE${\mathbf{v}}_{i}$: video feature, ${\mathbf{s}}_{i}$: semantic feature, ${\mathbf{x}}_{i}$: input array, $\u03f5$: sampling probability, $STEP$: max time step \ENSURE${\mathbf{h}}_{i}$: output state, ${\mathbf{c}}_{i}$: cell state \STATE${\mathbf{h}}_{i,0}\leftarrow \mathrm{\U0001d7ce}$ \STATE${\mathbf{c}}_{i,0}\leftarrow \mathrm{\U0001d7ce}$ \STATE${\mathbf{h}}_{i}\leftarrow \mathrm{\U0001d7ce}$ \STATE${\mathbf{c}}_{i}\leftarrow \mathrm{\U0001d7ce}$ \STATE$\mathrm{\mathbf{e}\mathbf{m}\mathbf{b}\mathbf{e}\mathbf{d}}\leftarrow {\mathbf{x}}_{i,0}$ \FOR$t=1$ \TO$STEP$ \STATE${\mathbf{h}}_{i,t},{\mathbf{c}}_{i,t}\leftarrow \mathit{\text{recurrent\_step}}({\mathbf{h}}_{i,t-1},{\mathbf{c}}_{i,t-1},{\mathbf{v}}_{i},{\mathbf{s}}_{i},\mathrm{\mathbf{e}\mathbf{m}\mathbf{b}\mathbf{e}\mathbf{d}})$ \STATEextend ${\mathbf{h}}_{i}$ with ${\mathbf{h}}_{i,t}$ \STATEextend ${\mathbf{c}}_{i}$ with ${\mathbf{c}}_{i,t}$ \STATE$prob\leftarrow \text{\mathit{r}\mathit{a}\mathit{n}\mathit{d}\mathit{o}\mathit{m}}(0,1)$ \IF$$ \STATE$\mathrm{\mathbf{p}\mathbf{r}\mathbf{o}\mathbf{b}}\mathrm{\_}{\mathrm{\mathbf{d}\mathbf{i}\mathbf{s}\mathbf{t}}}_{i,t}\leftarrow \mathit{\text{word\_dist\_map}}({\mathbf{h}}_{i,t})$ \COMMENTMap output state to word probability. \STATE$\mathrm{\mathbf{w}\mathbf{o}\mathbf{r}\mathbf{d}}\mathrm{\_}\mathrm{\mathbf{i}\mathbf{n}\mathbf{d}\mathbf{e}\mathbf{x}}\leftarrow \text{\mathit{m}\mathit{u}\mathit{l}\mathit{t}\mathit{i}\mathit{n}\mathit{o}\mathit{m}\mathit{i}\mathit{a}\mathit{l}}(\mathrm{\mathbf{p}\mathbf{r}\mathbf{o}\mathbf{b}}\mathrm{\_}{\mathrm{\mathbf{d}\mathbf{i}\mathbf{s}\mathbf{t}}}_{i,t})$ \COMMENTSample from the word distribution. \STATE$\mathrm{\mathbf{e}\mathbf{m}\mathbf{b}\mathbf{e}\mathbf{d}}\leftarrow \mathit{\text{lookup\_embed}}(\mathrm{\mathbf{w}\mathbf{o}\mathbf{r}\mathbf{d}}\mathrm{\_}\mathrm{\mathbf{i}\mathbf{n}\mathbf{d}\mathbf{e}\mathbf{x}})$ \COMMENTUse a embedding vector to represent the word. \ELSE\STATE$\mathrm{\mathbf{e}\mathbf{m}\mathbf{b}\mathbf{e}\mathbf{d}}\leftarrow {\mathbf{x}}_{i,t}$ \ENDIF\STATE$t\leftarrow t+1$ \ENDFOR\RETURN${\mathbf{h}}_{i},{\mathbf{c}}_{i}$

\makecell[lc] | |

$\beta =0$: a woman is mixing a bowl | |

$\beta =0.7$: a woman is mixing a bowl | |

$\beta =1$: a person is mixing a bowl of a bowl | |

GT: somebody is mixing flour | |

\makecell[lc] | |

$\beta =0$: a man is pouring a egg | |

$\beta =0.7$: a man is pouring eggs into a bowl | |

$\beta =1$: a man is adding a bowl of a bowl | |

GT: a man is pouring coconut juice into a bowl | |

\makecell[lc] | |

$\beta =0$: a man is talking about a boat | |

$\beta =0.7$: a main is talking about the water | |

$\beta =1$: a man is talking about the the the the the the the the | |

GT: some men having fun and talking about the sea | |

\makecell[lc] | |

$\beta =0$: a woman is sitting on a couch | |

$\beta =0.7$: a man and a woman are sitting in a bed | |

$\beta =1$: a man is sitting on a bed and a woman is sitting on the bed | |

GT: a man and woman are lying in bed together |

Model | B-4 | C | M | R | Overall (6.2) |

LSTM-E (Pan et al., 2016) | 45.3 | 31.0 | |||

h-RNN (Yu et al., 2016) | 49.9 | 65.8 | 32.6 | ||

aLSTMs (Gao et al., 2017) | 50.8 | 74.8 | 33.3 | ||

SCN (Gan et al., 2017) | 51.1 | 77.7 | 33.5 | ||

MTVC (Pasunuru and Bansal, 2017a) | 54.5 | 92.4 | 36.0 | 72.8 | 0.9198 |

ECO (Zolfaghari et al., 2018) | 53.5 | 85.8 | 35.0 | ||

SibNet (Liu et al., 2018) | 54.2 | 88.2 | 34.8 | 71.7 | 0.8969 |

Our model | 61.8 | 103.0 | 37.8 | 76.8 | 1.0000 |

Model | B-4 | C | M | R | Overall |

MSR-VTT Challenge 2017 | |||||

Rank1: v2t_navigator | 40.8 | 44.8 | 28.2 | 60.9 | 0.9325 |

Rank2: Aalto | 39.8 | 45.7 | 26.9 | 59.8 | 0.9157 |

Rank3: VideoLAB | 39.1 | 44.1 | 27.7 | 60.6 | 0.9140 |

State-of-the-Art Models | |||||

MTVC (Pasunuru and Bansal, 2017a) | 40.8 | 47.1 | 28.8 | 60.2 | 0.9459 |

CIDEnt-RL (Pasunuru and Bansal, 2017b) | 40.5 | 51.7 | 28.4 | 61.4 | 0.9678 |

SibNet (Liu et al., 2018) | 40.9 | 47.5 | 27.5 | 60.2 | 0.9374 |

HACA (Wang et al., 2018a) | 43.4 | 49.7 | 29.5 | 61.8 | 0.9856 |

TAMoE (Wang et al., 2018b) | 42.2 | 48.9 | 29.4 | 62.0 | 0.9749 |

Our model | 43.8 | 51.4 | 28.9 | 62.4 | 0.9935 |

Semantic Features (mAP) | B-4 | C | M | R | Overall |

0.2603 | 55.7 | 93.9 | 35.2 | 74.5 | 0.9286 |

0.4039 | 55.3 | 96.6 | 36.6 | 74.2 | 0.9418 |

0.4755 | 61.8 | 103.0 | 37.8 | 76.8 | 1.0000 |

Semantic Feature (mAP) | B-4 | C | M | R | Overall |

0.1994 | 41.6 | 46.3 | 27.4 | 60.9 | 0.9437 |

0.2188 | 42.2 | 48.9 | 27.9 | 61.9 | 0.9681 |

0.2441 | 43.8 | 51.4 | 28.9 | 62.4 | 1.0000 |

Training Method | B-4 | C | M | R | Overall |

Teacher Forcing | 60.4 | 93.9 | 37.4 | 75.8 | 0.9663 |

$\mathrm{arg}\mathrm{max}$ | 60.0 | 99.4 | 36.7 | 76.1 | 0.9744 |

Multinomial | 61.8 | 103.0 | 37.8 | 76.8 | 1.0000 |

Training Method | B-4 | C | M | R | Overall |

Teacher Forcing | 43.1 | 49.0 | 28.4 | 61.9 | 0.9780 |

$\mathrm{arg}\mathrm{max}$ | 44.0 | 50.1 | 28.5 | 62.4 | 0.9902 |

Multinomial | 43.8 | 51.4 | 28.9 | 62.4 | 0.9988 |

Model | $\beta =0$ | $\beta =0.7$ | $\beta =1$ | Ground Truth |

mLen1 | 5.12 | 5.18 | 5.80 | 7.01 |

mLen2 | 6.27 | 6.69 | 6.99 | 9.32 |