Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

  • 2024-10-29 18:52:20
  • Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh
  • 0

Abstract

Multi-object 3D Grounding involves locating 3D boxes based on a given queryphrase from a point cloud. It is a challenging and significant task withnumerous applications in visual understanding, human-computer interaction, androbotics. To tackle this challenge, we introduce D-LISA, a two-stage approachincorporating three innovations. First, a dynamic vision module that enables avariable and learnable number of box proposals. Second, a dynamic camerapositioning that extracts features for each proposal. Third, alanguage-informed spatial attention module that better reasons over theproposals to output the final prediction. Empirically, experiments show thatour method outperforms the state-of-the-art methods on multi-object 3Dgrounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

 

Quick Read (beta)

loading the full paper ...