Language Models are General-Purpose Interfaces

Abstract

Foundation models have received much attention due to their effectivenessacross a broad range of downstream applications. Though there is a bigconvergence in terms of architecture, most pretrained models are typicallystill developed for specific tasks or modalities. In this work, we propose touse language models as a general-purpose interface to various foundationmodels. A collection of pretrained encoders perceive diverse modalities (suchas vision, and language), and they dock with a language model that plays therole of a universal task layer. We propose a semi-causal language modelingobjective to jointly pretrain the interface and the modular encoders. Wesubsume the advantages and capabilities from both causal and non-causalmodeling, thereby combining the best of two worlds. Specifically, the proposedmethod not only inherits the capabilities of in-context learning and open-endedgeneration from causal language modeling, but also is conducive to finetuningbecause of the bidirectional encoders. More importantly, our approachseamlessly unlocks the combinations of the above capabilities, e.g., enablingin-context learning or instruction following with finetuned encoders.Experimental results across various language-only and vision-languagebenchmarks show that our model outperforms or is competitive with specializedmodels on finetuning, zero-shot generalization, and few-shot learning.

Quick Read (beta)

loading the full paper ...