A Review of 3D Object Detection with Vision-Language Models

  • 2025-04-26 00:27:26
  • Ranjan Sapkota, Konstantinos I Roumeliotis, Rahul Harsha Cheppally, Marco Flores Calero, Manoj Karkee
  • 0

Abstract

This review provides a systematic analysis of comprehensive survey of 3Dobject detection with vision-language models(VLMs) , a rapidly advancing areaat the intersection of 3D vision and multimodal AI. By examining over 100research papers, we provide the first systematic analysis dedicated to 3Dobject detection with vision-language models. We begin by outlining the uniquechallenges of 3D object detection with vision-language models, emphasizingdifferences from 2D detection in spatial reasoning and data complexity.Traditional approaches using point clouds and voxel grids are compared tomodern vision-language frameworks like CLIP and 3D LLMs, which enableopen-vocabulary detection and zero-shot generalization. We review keyarchitectures, pretraining strategies, and prompt engineering methods thatalign textual and 3D features for effective 3D object detection withvision-language models. Visualization examples and evaluation benchmarks arediscussed to illustrate performance and behavior. Finally, we highlight currentchallenges, such as limited 3D-language datasets and computational demands, andpropose future research directions to advance 3D object detection withvision-language models. >Object Detection, Vision-Language Models, Agents,VLMs, LLMs, AI

 

Quick Read (beta)

loading the full paper ...