Abstract
Recent advances in text-to-image synthesis have led to large pretrainedtransformers with excellent capabilities to generate visualizations from agiven text. However, these models are ill-suited for specialized tasks likestory visualization, which requires an agent to produce a sequence of imagesgiven a corresponding sequence of captions, forming a narrative. Moreover, wefind that the story visualization task fails to accommodate generalization tounseen plots and characters in new narratives. Hence, we first propose the taskof story continuation, where the generated visual story is conditioned on asource image, allowing for better generalization to narratives with newcharacters. Then, we enhance or 'retro-fit' the pretrained text-to-imagesynthesis models with task-specific modules for (a) sequential image generationand (b) copying relevant elements from an initial frame. Then, we explorefull-model finetuning, as well as prompt-based tuning for parameter-efficientadaptation, of the pre-trained model. We evaluate our approach StoryDALL-E ontwo existing datasets, PororoSV and FlintstonesSV, and introduce a new datasetDiDeMoSV collected from a video-captioning dataset. We also develop a modelStoryGANc based on Generative Adversarial Networks (GAN) for storycontinuation, and compare it with the StoryDALL-E model to demonstrate theadvantages of our approach. We show that our retro-fitting approach outperformsGAN-based models for story continuation and facilitates copying of visualelements from the source image, thereby improving continuity in the generatedvisual story. Finally, our analysis suggests that pretrained transformersstruggle to comprehend narratives containing several characters. Overall, ourwork demonstrates that pretrained text-to-image synthesis models can be adaptedfor complex and low-resource tasks like story continuation.