Abstract
A great challenge in video-language (VidL) modeling lies in the disconnectionbetween fixed video representations extracted from image/video understandingmodels and downstream VidL data. Recent studies try to mitigate thisdisconnection via end-to-end training. To make it computationally feasible,prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampledframes are fed into a 2D CNN, followed by a simple mean-pooling orconcatenation to obtain the overall video representations. Although achievingpromising results, such simple approaches may lose temporal information that isessential for performing downstream VidL tasks. In this work, we presentVIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a videotransformer to explicitly model the temporal dynamics of video inputs. Further,unlike previous studies that found pre-training tasks on video inputs (e.g.,masked frame modeling) not very effective, we design a new pre-training task,Masked Visual-token Modeling (MVM), for better video modeling. Specifically,the original video frame patches are "tokenized" into discrete visual tokens,and the goal is to recover the original visual tokens based on the maskedpatches. Comprehensive analysis demonstrates the effectiveness of both explicittemporal modeling via video transformer and MVM. As a result, VIOLET achievesnew state-of-the-art performance on 5 video question answering tasks and 4text-to-video retrieval tasks.