Abstract
To enable AI agents to interact seamlessly with both humans and 3Denvironments, they must not only perceive the 3D world accurately but alsoalign human language with 3D spatial representations. While prior work has madesignificant progress by integrating language features into geometricallydetailed 3D scene representations using 3D Gaussian Splatting (GS), theseapproaches rely on computationally intensive offline preprocessing of languagefeatures for each input image, limiting adaptability to new environments. Inthis work, we introduce Online Language Splatting, the first framework toachieve online, near real-time, open-vocabulary language mapping within a3DGS-SLAM system without requiring pre-generated language features. The keychallenge lies in efficiently fusing high-dimensional language features into 3Drepresentations while balancing the computation speed, memory usage, renderingquality and open-vocabulary capability. To this end, we innovatively design:(1) a high-resolution CLIP embedding module capable of generating detailedlanguage feature maps in 18ms per frame, (2) a two-stage online auto-encoderthat compresses 768-dimensional CLIP features to 15 dimensions while preservingopen-vocabulary capabilities, and (3) a color-language disentangledoptimization approach to improve rendering quality. Experimental results showthat our online method not only surpasses the state-of-the-art offline methodsin accuracy but also achieves more than 40x efficiency boost, demonstrating thepotential for dynamic and interactive AI applications.