MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Abstract

We train a model to generate images from multimodal prompts of interleavedtext and images such as "a <picture of a man> man and his <picture of a dog>dog in an <picture of a cartoon> animated style." We bootstrap a multimodaldataset by extracting semantically meaningful image crops corresponding towords in the image captions of synthetically generated and publicly availabletext-image data. Our model, MUMU, is composed of a vision-language modelencoder with a diffusion decoder and is trained on a single 8xH100 GPU node.Despite being only trained on crops from the same image, MUMU learns to composeinputs from different images into a coherent output. For example, an input of arealistic person and a cartoon will output the same person in the cartoonstyle, and an input of a standing subject and a scooter will output the subjectriding the scooter. As a result, our model generalizes to tasks such as styletransfer and character consistency. Our results show the promise of usingmultimodal models as general purpose controllers for image generation.

Quick Read (beta)

loading the full paper ...