4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

Abstract

Learning 4D language fields to enable time-sensitive, open-ended languagequeries in dynamic scenes is essential for many real-world applications. WhileLangSplat successfully grounds CLIP features into 3D Gaussian representations,achieving precision and efficiency in 3D static scenes, it lacks the ability tohandle dynamic 4D fields as CLIP, designed for static image-text tasks, cannotcapture temporal dynamics in videos. Real-world environments are inherentlydynamic, with object semantics evolving over time. Building a precise 4Dlanguage field necessitates obtaining pixel-aligned, object-wise videofeatures, which current vision models struggle to achieve. To address thesechallenges, we propose 4D LangSplat, which learns 4D language fields to handletime-agnostic or time-sensitive open-vocabulary queries in dynamic scenesefficiently. 4D LangSplat bypasses learning the language field from visionfeatures and instead learns directly from text generated from object-wise videocaptions via Multimodal Large Language Models (MLLMs). Specifically, we proposea multimodal object-wise video prompting method, consisting of visual and textprompts that guide MLLMs to generate detailed, temporally consistent,high-quality captions for objects throughout a video. These captions areencoded using a Large Language Model into high-quality sentence embeddings,which then serve as pixel-aligned, object-specific feature supervision,facilitating open-vocabulary text queries through shared embedding spaces.Recognizing that objects in 4D scenes exhibit smooth transitions across states,we further propose a status deformable network to model these continuouschanges over time effectively. Our results across multiple benchmarksdemonstrate that 4D LangSplat attains precise and efficient results for bothtime-sensitive and time-agnostic open-vocabulary queries.

Quick Read (beta)

loading the full paper ...