Uni3DL: Unified Model for 3D and Language Understanding

Abstract

In this work, we present Uni3DL, a unified model for 3D and Languageunderstanding. Distinct from existing unified vision-language models in 3Dwhich are limited in task variety and predominantly dependent on projectedmulti-view images, Uni3DL operates directly on point clouds. This approachsignificantly expands the range of supported tasks in 3D, encompassing bothvision and vision-language tasks in 3D. At the core of Uni3DL, a querytransformer is designed to learn task-agnostic semantic and mask outputs byattending to 3D visual features, and a task router is employed to selectivelygenerate task-specific outputs required for diverse tasks. With a unifiedarchitecture, our Uni3DL model enjoys seamless task decomposition andsubstantial parameter sharing across tasks. Uni3DL has been rigorouslyevaluated across diverse 3D vision-language understanding tasks, includingsemantic segmentation, object detection, instance segmentation, visualgrounding, 3D captioning, and text-3D cross-modal retrieval. It demonstratesperformance on par with or surpassing state-of-the-art (SOTA) task-specificmodels. We hope our benchmark and Uni3DL model will serve as a solid step toease future research in unified models in the realm of 3D and languageunderstanding. Project page: https://uni3dl.github.io.

Quick Read (beta)

loading the full paper ...