Is Attention All NeRF Needs?

Abstract

We present Generalizable NeRF Transformer (GNT), a pure, unifiedtransformer-based architecture that efficiently reconstructs Neural RadianceFields (NeRFs) on the fly from source views. Unlike prior works on NeRF thatoptimize a per-scene implicit representation by inverting a handcraftedrendering equation, GNT achieves generalizable neural scene representation andrendering, by encapsulating two transformer-based stages. The first stage ofGNT, called view transformer, leverages multi-view geometry as an inductivebias for attention-based scene representation, and predicts coordinate-alignedfeatures by aggregating information from epipolar lines on the neighboringviews. The second stage of GNT, named ray transformer, renders novel views byray marching and directly decodes the sequence of sampled point features usingthe attention mechanism. Our experiments demonstrate that when optimized on asingle scene, GNT can successfully reconstruct NeRF without explicit renderingformula, and even improve the PSNR by ~1.3dB on complex scenes due to thelearnable ray renderer. When trained across various scenes, GNT consistentlyachieves the state-of-the-art performance when transferring to forward-facingLLFF dataset (LPIPS ~20%, SSIM ~25%$) and synthetic blender dataset (LPIPS~20%, SSIM ~4%). In addition, we show that depth and occlusion can be inferredfrom the learned attention maps, which implies that the pure attentionmechanism is capable of learning a physically-grounded rendering process. Allthese results bring us one step closer to the tantalizing hope of utilizingtransformers as the "universal modeling tool" even for graphics. Please referto our project page for video results: https://vita-group.github.io/GNT/.

Quick Read (beta)

loading the full paper ...