High-Fidelity Pluralistic Image Completion with Transformers

Abstract

Image completion has made tremendous progress with convolutional neuralnetworks (CNNs), because of their powerful texture modeling capacity. However,due to some inherent properties (e.g., local inductive prior, spatial-invariantkernels), CNNs do not perform well in understanding global structures ornaturally support pluralistic completion. Recently, transformers demonstratetheir power in modeling the long-term relationship and generating diverseresults, but their computation complexity is quadratic to input length, thushampering the application in processing high-resolution images. This paperbrings the best of both worlds to pluralistic image completion: appearanceprior reconstruction with transformer and texture replenishment with CNN. Theformer transformer recovers pluralistic coherent structures together with somecoarse textures, while the latter CNN enhances the local texture details ofcoarse priors guided by the high-resolution masked images. The proposed methodvastly outperforms state-of-the-art methods in terms of three aspects: 1) largeperformance boost on image fidelity even compared to deterministic completionmethods; 2) better diversity and higher fidelity for pluralistic completion; 3)exceptional generalization ability on large masks and generic dataset, likeImageNet.

Quick Read (beta)

loading the full paper ...