Can 3D Vision-Language Models Truly Understand Natural Language?

Abstract

Rapid advancements in 3D vision-language (3D-VL) tasks have opened up newavenues for human interaction with embodied agents or robots using naturallanguage. Despite this progress, we find a notable limitation: existing 3D-VLmodels exhibit sensitivity to the styles of language input, struggling tounderstand sentences with the same semantic meaning but written in differentvariants. This observation raises a critical question: Can 3D vision-languagemodels truly understand natural language? To test the languageunderstandability of 3D-VL models, we first propose a language robustness taskfor systematically assessing 3D-VL models across various tasks, benchmarkingtheir performance when presented with different language style variants.Importantly, these variants are commonly encountered in applications requiringdirect interaction with humans, such as embodied robotics, given the diversityand unpredictability of human language. We propose a 3D Language RobustnessDataset, designed based on the characteristics of human language, to facilitatethe systematic study of robustness. Our comprehensive evaluation uncovers asignificant drop in the performance of all existing models across various 3D-VLtasks. Even the state-of-the-art 3D-LLM fails to understand some variants ofthe same sentences. Further in-depth analysis suggests that the existing modelshave a fragile and biased fusion module, which stems from the low diversity ofthe existing dataset. Finally, we propose a training-free module driven by LLM,which improves language robustness. Datasets and code will be available atgithub.

Quick Read (beta)

loading the full paper ...