CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields

Abstract

We present CLIP-NeRF, a multi-modal 3D object manipulation method for neuralradiance fields (NeRF). By leveraging the joint language-image embedding spaceof the recent Contrastive Language-Image Pre-Training (CLIP) model, we proposea unified framework that allows manipulating NeRF in a user-friendly way, usingeither a short text prompt or an exemplar image. Specifically, to combine thenovel view synthesis capability of NeRF and the controllable manipulationability of latent representations from generative models, we introduce adisentangled conditional NeRF architecture that allows individual control overboth shape and appearance. This is achieved by performing the shapeconditioning via applying a learned deformation field to the positionalencoding and deferring color conditioning to the volumetric rendering stage. Tobridge this disentangled latent representation to the CLIP embedding, we designtwo code mappers that take a CLIP embedding as input and update the latentcodes to reflect the targeted editing. The mappers are trained with aCLIP-based matching loss to ensure the manipulation accuracy. Furthermore, wepropose an inverse optimization method that accurately projects an input imageto the latent codes for manipulation to enable editing on real images. Weevaluate our approach by extensive experiments on a variety of text prompts andexemplar images and also provide an intuitive interface for interactiveediting. Our implementation is available athttps://cassiepython.github.io/clipnerf/

Quick Read (beta)

loading the full paper ...