360FusionNeRF: Panoramic Neural Radiance Fields with Joint Guidance

Abstract

We present a method to synthesize novel views from a single $360^\circ$panorama image based on the neural radiance field (NeRF). Prior studies in asimilar setting rely on the neighborhood interpolation capability ofmulti-layer perceptions to complete missing regions caused by occlusion, whichleads to artifacts in their predictions. We propose 360FusionNeRF, asemi-supervised learning framework where we introduce geometric supervision andsemantic consistency to guide the progressive training process. Firstly, theinput image is re-projected to $360^\circ$ images, and auxiliary depth maps areextracted at other camera positions. The depth supervision, in addition to theNeRF color guidance, improves the geometry of the synthesized views.Additionally, we introduce a semantic consistency loss that encouragesrealistic renderings of novel views. We extract these semantic features using apre-trained visual encoder such as CLIP, a Vision Transformer trained onhundreds of millions of diverse 2D photographs mined from the web with naturallanguage supervision. Experiments indicate that our proposed method can produceplausible completions of unobserved regions while preserving the features ofthe scene. When trained across various scenes, 360FusionNeRF consistentlyachieves the state-of-the-art performance when transferring to syntheticStructured3D dataset (PSNR~5%, SSIM~3% LPIPS~13%), real-world Matterport3Ddataset (PSNR~3%, SSIM~3% LPIPS~9%) and Replica360 dataset (PSNR~8%, SSIM~2%LPIPS~18%).

Quick Read (beta)

loading the full paper ...