Abstract
Pretrained models have produced great success in both Computer Vision (CV)and Natural Language Processing (NLP). This progress leads to learning jointrepresentations of vision and language pretraining by feeding visual andlinguistic contents into a multi-layer transformer, Visual-Language PretrainedModels (VLPMs). In this paper, we present an overview of the major advancesachieved in VLPMs for producing joint representations of vision and language.As the preliminaries, we briefly describe the general task definition andgenetic architecture of VLPMs. We first discuss the language and vision dataencoding methods and then present the mainstream VLPM structure as the corecontent. We further summarise several essential pretraining and fine-tuningstrategies. Finally, we highlight three future directions for both CV and NLPresearchers to provide insightful guidance.