Abstract
Recent multimodal models such as DALL-E and CM3 have achieved remarkableprogress in text-to-image and image-to-text generation. However, these modelsstore all learned knowledge (e.g., the appearance of the Eiffel Tower) in themodel parameters, requiring increasingly larger models and training data tocapture more knowledge. To integrate knowledge in a more scalable and modularway, we propose a retrieval-augmented multimodal model, which enables a basemultimodal model (generator) to refer to relevant knowledge fetched by aretriever from external memory (e.g., multimodal documents on the web).Specifically, we implement a retriever using the pretrained CLIP model and agenerator using the CM3 Transformer architecture, and train this model usingthe LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3),is the first multimodal model that can retrieve and generate mixtures of textand images. We show that RA-CM3 significantly outperforms baseline multimodalmodels such as DALL-E and CM3 on both image and caption generation tasks (12FID and 17 CIDEr improvements on MS-COCO), while requiring much less computefor training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novelcapabilities such as knowledge-intensive image generation and multimodalin-context learning.