Language Models Can See: Plugging Visual Controls in Text Generation

Abstract

Generative language models (LMs) such as GPT-2/3 can be prompted to generatetext with remarkable quality. While they are designed for text-promptedgeneration, it remains an open question how the generation process could beguided by modalities beyond text such as images. In this work, we propose atraining-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP),for plugging in visual controls in the generation process and enabling LMs toperform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGICis a simple yet efficient plug-and-play framework, which directly combines anoff-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP)for image-grounded text generation. During decoding, MAGIC influences thegeneration of the LM by introducing a CLIP-induced score, called magic score,which regularizes the generated result to be semantically related to a givenimage while being coherent to the previously generated context. Notably, theproposed decoding scheme does not involve any gradient update operation,therefore being computationally efficient. On the challenging task of zero-shotimage captioning, MAGIC outperforms the state-of-the-art method by notablemargins with a nearly 27 times decoding speedup. MAGIC is a flexible frameworkand is theoretically compatible with any text generation tasks that incorporateimage grounding. In the experiments, we showcase that it is also capable ofperforming visually grounded story generation given both an image and a textprompt.

Quick Read (beta)

loading the full paper ...