Abstract
Humans describe the physical world using natural language to refer tospecific 3D locations based on a vast range of properties: visual appearance,semantics, abstract associations, or actionable affordances. In this work wepropose Language Embedded Radiance Fields (LERFs), a method for groundinglanguage embeddings from off-the-shelf models like CLIP into NeRF, which enablethese types of open-ended language queries in 3D. LERF learns a dense,multi-scale language field inside NeRF by volume rendering CLIP embeddingsalong training rays, supervising these embeddings across training views toprovide multi-view consistency and smooth the underlying language field. Afteroptimization, LERF can extract 3D relevancy maps for a broad range of languageprompts interactively in real-time, which has potential use cases in robotics,understanding vision-language models, and interacting with 3D scenes. LERFenables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddingswithout relying on region proposals or masks, supporting long-tailopen-vocabulary queries hierarchically across the volume. The project websitecan be found at https://lerf.io .