A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Abstract

Given the features of a video, recurrent neural network can be used toautomatically generate a caption for the video. Existing methods for videocaptioning have at least three limitations. First, semantic information hasbeen widely applied to boost the performance of video captioning models, butexisting networks often fail to provide meaningful semantic features. Second,Teacher Forcing algorithm is often utilized to optimize video captioningmodels, but during training and inference, different strategies are applied toguide word generation, which lead to poor performance. Third, current videocaptioning models are prone to generate relatively short captions, whichexpress video contents inappropriately. Towards resolving these three problems,we make three improvements correspondingly. First of all, we utilize bothstatic spatial features and dynamic spatio-temporal features as input forsemantic detection network (SDN) in order to generate meaningful semanticfeatures for videos. Then, we propose a scheduled sampling strategy whichgradually transfers the training phase from a teacher guiding manner towards amore self teaching manner. At last, the ordinary logarithm probability lossfunction is leveraged by sentence length so that short sentence inclination isalleviated. Our model achieves state-of-the-art results on the Youtube2Textdataset and is competitive with the state-of-the-art models on the MSR-VTTdataset.

Quick Read (beta)

loading the full paper ...