Abstract
We introduce a new task, Video-and-Language Inference, for joint multimodalunderstanding of video and text. Given a video clip with aligned subtitles aspremise, paired with a natural language hypothesis based on the video content,a model needs to infer whether the hypothesis is entailed or contradicted bythe given video clip. A new large-scale dataset, named Violin(VIdeO-and-Language INference), is introduced for this task, which consists of95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hoursof video. These video clips contain rich content with diverse temporaldynamics, event shifts, and people interactions, collected from two sources:(i) popular TV shows, and (ii) movie clips from YouTube channels. In order toaddress our new multimodal inference task, a model is required to possesssophisticated reasoning skills, from surface-level grounding (e.g., identifyingobjects and characters in the video) to in-depth commonsense reasoning (e.g.,inferring causal relations of events in the video). We present a detailedanalysis of the dataset and an extensive evaluation over many strong baselines,providing valuable insights on the challenges of this new task.