Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models

Abstract

Generative models have recently exhibited exceptional capabilities in variousscenarios, for example, image generation based on text description. In thiswork, we focus on the task of generating a series of coherent image sequencebased on a given storyline, denoted as open-ended visual storytelling. We makethe following three contributions: (i) to fulfill the task of visualstorytelling, we introduce two modules into a pre-trained stable diffusionmodel, and construct an auto-regressive image generator, termed as StoryGen,that enables to generate the current frame by conditioning on both a textprompt and a preceding frame; (ii) to train our proposed model, we collectpaired image and text samples by sourcing from various online sources, such asvideos, E-books, and establish a data processing pipeline for constructing adiverse dataset, named StorySalon, with a far larger vocabulary than existinganimation-specific datasets; (iii) we adopt a three-stage curriculum trainingstrategy, that enables style transfer, visual context conditioning, and humanfeedback alignment, respectively. Quantitative experiments and human evaluationhave validated the superiority of our proposed model, in terms of imagequality, style consistency, content consistency, and visual-language alignment.We will make the code, model, and dataset publicly available to the researchcommunity.

Quick Read (beta)

loading the full paper ...