LangSplat: 3D Language Gaussian Splatting

  • 2024-03-31 05:45:58
  • Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, Hanspeter Pfister
Humans live in a 3D world and commonly use natural language to interact witha 3D scene. Modeling a 3D language field to support open-ended language queriesin 3D has gained increasing attention recently. This paper introducesLangSplat, which constructs a 3D language field that enables precise andefficient open-vocabulary querying within 3D spaces. Unlike existing methodsthat ground CLIP language embeddings in a NeRF model, LangSplat advances thefield by utilizing a collection of 3D Gaussians, each encoding languagefeatures distilled from CLIP, to represent the language field. By employing atile-based splatting technique for rendering language features, we circumventthe costly rendering process inherent in NeRF. Instead of directly learningCLIP embeddings, LangSplat first trains a scene-wise language autoencoder andthen learns language features on the scene-specific latent space, therebyalleviating substantial memory demands imposed by explicit modeling. Existingmethods struggle with imprecise and vague 3D language fields, which fail todiscern clear boundaries between objects. We delve into this issue and proposeto learn hierarchical semantics using SAM, thereby eliminating the need forextensively querying the language field across various scales and theregularization of DINO features. Extensive experimental results show thatLangSplat significantly outperforms the previous state-of-the-art method LERFby a large margin. Notably, LangSplat is extremely efficient, achieving a 199$\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. Westrongly recommend readers to check out our video results at


