VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Abstract

We present a new large-scale multilingual video description dataset, VATEX,which contains over 41,250 videos and 825,000 captions in both English andChinese. Among the captions, there are over 206,000 English-Chinese paralleltranslation pairs. Compared to the widely-used MSR-VTT dataset, VATEX ismultilingual, larger, linguistically complex, and more diverse in terms of bothvideo and natural language descriptions. We also introduce two tasks forvideo-and-language research based on VATEX: (1) Multilingual Video Captioning,aimed at describing a video in various languages with a compact unifiedcaptioning model, and (2) Video-guided Machine Translation, to translate asource language description into the target language using the videoinformation as additional spatiotemporal context. Extensive experiments on theVATEX dataset show that, first, the unified multilingual model can not onlyproduce both English and Chinese descriptions for a video more efficiently, butalso offer improved performance over the monolingual models. Furthermore, wedemonstrate that the spatiotemporal video context can be effectively utilizedto align source and target languages and thus assist machine translation. Inthe end, we discuss the potentials of using VATEX for other video-and-languageresearch.

Quick Read (beta)

loading the full paper ...