How2: A Large-scale Dataset for Multimodal Language Understanding

Abstract

In this paper, we introduce How2, a multimodal collection of instructionalvideos with English subtitles and crowdsourced Portuguese translations. We alsopresent integrated sequence-to-sequence baselines for machine translation,automatic speech recognition, spoken language translation, and multimodalsummarization. By making available data and code for several multimodal naturallanguage tasks, we hope to stimulate more research on these and similarchallenges, to obtain a deeper understanding of multimodality in languageprocessing.

Quick Read (beta)

loading the full paper ...