Abstract
Image inpainting task refers to erasing unwanted pixels from images andfilling them in a semantically consistent and realistic way. Traditionally, thepixels that are wished to be erased are defined with binary masks. From theapplication point of view, a user needs to generate the masks for the objectsthey would like to remove which can be time-consuming and prone to errors. Inthis work, we are interested in an image inpainting algorithm that estimateswhich object to be removed based on natural language input and also removes it,simultaneously. For this purpose, first, we construct a dataset namedGQA-Inpaint for this task which will be released soon. Second, we present anovel inpainting framework, Inst-Inpaint, that can remove objects from imagesbased on the instructions given as text prompts. We set various GAN anddiffusion-based baselines and run experiments on synthetic and real imagedatasets. We compare methods with different evaluation metrics that measure thequality and accuracy of the models and show significant quantitative andqualitative improvements.