UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

  • 2020-02-15 10:03:25
  • Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, Ming Zhou
  • 13

Abstract

We propose UniViLM: a Unified Video and Language pre-training Model formultimodal understanding and generation. Motivated by the recent success ofBERT based pre-training technique for NLP and image-language tasks, VideoBERTand CBT are proposed to exploit BERT model for video and language pre-trainingusing narrated instructional videos. Different from their works which onlypre-train understanding task, we propose a unified video-language pre-trainingmodel for both understanding and generation tasks. Our model comprises of 4components including two single-modal encoders, a cross encoder and a decoderwith the Transformer backbone. We first pre-train our model to learn theuniversal representation for both video and language on a large instructionalvideo dataset. Then we fine-tune the model on two multimodal tasks includingunderstanding task (text-based video retrieval) and generation task (multimodalvideo captioning). Our extensive experiments show that our method can improvethe performance of both understanding and generation tasks and achieves thestate-of-the art results.

 

Quick Read (beta)

loading the full paper ...