Geometry-Free View Synthesis: Transformers and no 3D Priors

Abstract

Is a geometric model required to synthesize novel views from a single image?Being bound to local convolutions, CNNs need explicit 3D biases to modelgeometric transformations. In contrast, we demonstrate that a transformer-basedmodel can synthesize entirely novel views without any hand-engineered 3Dbiases. This is achieved by (i) a global attention mechanism for implicitlylearning long-range 3D correspondences between source and target views, and(ii) a probabilistic formulation necessary to capture the ambiguity inherent inpredicting novel views from a single image, thereby overcoming the limitationsof previous approaches that are restricted to relatively small viewpointchanges. We evaluate various ways to integrate 3D priors into a transformerarchitecture. However, our experiments show that no such geometric priors arerequired and that the transformer is capable of implicitly learning 3Drelationships between images. Furthermore, this approach outperforms the stateof the art in terms of visual quality while covering the full distribution ofpossible realizations. Code is available at https://git.io/JOnwn

Quick Read (beta)

loading the full paper ...