Abstract
We propose UniViLM: a Unified Video and Language pre-training Model formultimodal understanding and generation. Motivated by the recent success ofBERT based pre-training technique for NLP and image-language tasks, VideoBERTand CBT are proposed to exploit BERT model for video and language pre-trainingusing narrated instructional videos. Different from their works which onlypre-train understanding task, we propose a unified video-language pre-trainingmodel for both understanding and generation tasks. Our model comprises of 4components including two single-modal encoders, a cross encoder and a decoderwith the Transformer backbone. We first pre-train our model to learn theuniversal representation for both video and language on a large instructionalvideo dataset. Then we fine-tune the model on two multimodal tasks includingunderstanding task (text-based video retrieval) and generation task (multimodalvideo captioning). Our extensive experiments show that our method can improvethe performance of both understanding and generation tasks and achieves thestate-of-the art results.