Abstract
We train a model to generate images from multimodal prompts of interleavedtext and images such as "a <picture of a man> man and his <picture of a dog>dog in an <picture of a cartoon> animated style." We bootstrap a multimodaldataset by extracting semantically meaningful image crops corresponding towords in the image captions of synthetically generated and publicly availabletext-image data. Our model, MUMU, is composed of a vision-language modelencoder with a diffusion decoder and is trained on a single 8xH100 GPU node.Despite being only trained on crops from the same image, MUMU learns to composeinputs from different images into a coherent output. For example, an input of arealistic person and a cartoon will output the same person in the cartoonstyle, and an input of a standing subject and a scooter will output the subjectriding the scooter. As a result, our model generalizes to tasks such as styletransfer and character consistency. Our results show the promise of usingmultimodal models as general purpose controllers for image generation.