Abstract
Natural language BERTs are trained with language corpus in a self-supervisedmanner. Unlike natural language BERTs, vision language BERTs need paired datato train, which restricts the scale of VL-BERT pretraining. We propose aself-training approach that allows training VL-BERTs from unlabeled image data.The proposed method starts with our unified conditional model -- a visionlanguage BERT model that can perform zero-shot conditional generation. Givendifferent conditions, the unified conditional model can generate captions,dense captions, and even questions. We use the labeled image data to train ateacher model and use the trained model to generate pseudo captions onunlabeled image data. We then combine the labeled data and pseudo labeled datato train a student model. The process is iterated by putting the student modelas a new teacher. By using the proposed self-training approach and only 300kunlabeled extra data, we are able to get competitive or even betterperformances compared to the models of similar model size trained with 3million extra image data.