WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar

Abstract

The perception of waterways based on human intent is significant forautonomous navigation and operations of Unmanned Surface Vehicles (USVs) inwater environments. Inspired by visual grounding, we introduce WaterVG, thefirst visual grounding dataset designed for USV-based waterway perception basedon human prompts. WaterVG encompasses prompts describing multiple targets, withannotations at the instance level including bounding boxes and masks. Notably,WaterVG includes 11,568 samples with 34,987 referred targets, whose promptsintegrates both visual and radar characteristics. The pattern of text-guidedtwo sensors equips a finer granularity of text prompts with visual and radarfeatures of referred targets. Moreover, we propose a low-power visual groundingmodel, Potamoi, which is a multi-task model with a well-designed PhasedHeterogeneous Modality Fusion (PHMF) mode, including Adaptive Radar Weighting(ARW) and Multi-Head Slim Cross Attention (MHSCA). Exactly, ARW extractsrequired radar features to fuse with vision for prompt alignment. MHSCA is anefficient fusion module with a remarkably small parameter count and FLOPs,elegantly fusing scenario context captured by two sensors with linguisticfeatures, which performs expressively on visual grounding tasks. Comprehensiveexperiments and evaluations have been conducted on WaterVG, where our Potamoiarchives state-of-the-art performances compared with counterparts.

Quick Read (beta)

loading the full paper ...