LayoutBERT: Masked Language Layout Model for Object Insertion

Abstract

Image compositing is one of the most fundamental steps in creative workflows.It involves taking objects/parts of several images to create a new image,called a composite. Currently, this process is done manually by creatingaccurate masks of objects to be inserted and carefully blending them with thetarget scene or images, usually with the help of tools such as Photoshop orGIMP. While there have been several works on automatic selection of objects forcreating masks, the problem of object placement within an image with thecorrect position, scale, and harmony remains a difficult problem with limitedexploration. Automatic object insertion in images or designs is a difficultproblem as it requires understanding of the scene geometry and the colorharmony between objects. We propose LayoutBERT for the object insertion task.It uses a novel self-supervised masked language model objective andbidirectional multi-head self-attention. It outperforms previous layout-basedlikelihood models and shows favorable properties in terms of model capacity. Wedemonstrate the effectiveness of our approach for object insertion in the imagecompositing setting and other settings like documents and design templates. Wefurther demonstrate the usefulness of the learned representations forlayout-based retrieval tasks. We provide both qualitative and quantitativeevaluations on datasets from diverse domains like COCO, PublayNet, and two newdatasets which we call Image Layouts and Template Layouts. Image Layouts whichconsists of 5.8 million images with layout annotations is the largest imagelayout dataset to our knowledge. We also share ablation study results on theeffect of dataset size, model size and class sample size for this task.

Quick Read (beta)

loading the full paper ...