Abstract
3D visual grounding is a critical skill for household robots, enabling themto navigate, manipulate objects, and answer questions based on theirenvironment. While existing approaches often rely on extensive labeled data orexhibit limitations in handling complex language queries, we proposeLLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model(LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM todecompose complex natural language queries into semantic constituents andemploys a visual grounding tool, such as OpenScene or LERF, to identify objectsin a 3D scene. The LLM then evaluates the spatial and commonsense relationsamong the proposed objects to make a final grounding decision. Our method doesnot require any labeled training data and can generalize to novel 3D scenes andarbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark anddemonstrate state-of-the-art zero-shot grounding accuracy. Our findingsindicate that LLMs significantly improve the grounding capability, especiallyfor complex language queries, making LLM-Grounder an effective approach for 3Dvision-language tasks in robotics. Videos and interactive demos can be found onthe project website https://chat-with-nerf.github.io/ .