Prefix Language Models are Unified Modal Learners

Abstract

With the success of vision-language pre-training, we have witnessed thestate-of-the-art has been pushed on multi-modal understanding and generation.However, the current pre-training paradigm is either incapable of targeting allmodalities at once (e.g., text generation and image generation), or requiresmulti-fold well-designed tasks which significantly limits the scalability. Wedemonstrate that a unified modal model could be learned with a prefix languagemodeling objective upon text and image sequences. Thanks to the simple butpowerful pre-training paradigm, our proposed model, DaVinci, is simple totrain, scalable to huge data, and adaptable to a variety of downstream tasksacross modalities (language / vision / vision+language), types (understanding /generation) and settings (e.g., zero-shot, fine-tuning, linear evaluation) witha single unified architecture. DaVinci achieves the competitive performance ona wide range of 26 understanding / generation tasks, and outperforms previousunified vision-language models on most tasks, including ImageNet classification(+1.6%), VQAv2 (+1.4%), COCO caption generation (BLEU@4 +1.1%, CIDEr +1.5%) andCOCO image generation (IS +0.9%, FID -1.0%), at the comparable model and datascale. Furthermore, we offer a well-defined benchmark for future research byreporting the performance on different scales of the pre-training dataset on aheterogeneous and wide distribution coverage. Our results establish new,stronger baselines for future comparisons at different data scales and shedlight on the difficulties of comparing VLP models more generally.

Quick Read (beta)

loading the full paper ...