LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Abstract

Unified vision-language frameworks have greatly advanced in recent years,most of which adopt an encoder-decoder architecture to unify image-text tasksas sequence-to-sequence generation. However, existing video-language (VidL)models still require task-specific designs in model architecture and trainingobjectives for each task. In this work, we explore a unified VidL frameworkLAVENDER, where Masked Language Modeling (MLM) is used as the common interfacefor all pre-training and downstream tasks. Such unification leads to asimplified model architecture, where only a lightweight MLM head, instead of adecoder with much more parameters, is needed on top of the multimodal encoder.Surprisingly, experimental results show that this unified framework achievescompetitive performance on 14 VidL benchmarks, covering video questionanswering, text-to-video retrieval and video captioning. Extensive analysesfurther demonstrate the advantage of LAVENDER over existing VidL methods in:(i) supporting all downstream tasks with just a single set of parameter valueswhen multi-task finetuned; (ii) few-shot generalization on various downstreamtasks; and (iii) enabling zero-shot evaluation on video question answeringtasks. Code is available at https://github.com/microsoft/LAVENDER.

Quick Read (beta)

loading the full paper ...