Abstract
Diffusion models have recently demonstrated their effectiveness in generatingextremely high-quality images and are now utilized in a wide range ofapplications, including automatic sketch colorization. Although many methodshave been developed for guided sketch colorization, there has been limitedexploration of the potential conflicts between image prompts and sketch inputs,which can lead to severe deterioration in the results. Therefore, this paperexhaustively investigates reference-based sketch colorization models that aimto colorize sketch images using reference color images. We specificallyinvestigate two critical aspects of reference-based diffusion models: the"distribution problem", which is a major shortcoming compared to text-basedcounterparts, and the capability in zero-shot sequential text-basedmanipulation. We introduce two variations of an image-guided latent diffusionmodel utilizing different image tokens from the pre-trained CLIP image encoderand propose corresponding manipulation methods to adjust their resultssequentially using weighted text inputs. We conduct comprehensive evaluationsof our models through qualitative and quantitative experiments as well as auser study.