UEMM-Air: Make Unmanned Aerial Vehicles Perform More Multi-modal Tasks

Abstract

The development of multi-modal learning for Unmanned Aerial Vehicles (UAVs)typically relies on a large amount of pixel-aligned multi-modal image data.However, existing datasets face challenges such as limited modalities, highconstruction costs, and imprecise annotations. To this end, we propose asynthetic multi-modal UAV-based multi-task dataset, UEMM-Air. Specifically, wesimulate various UAV flight scenarios and object types using the Unreal Engine(UE). Then we design the UAV's flight logic to automatically collect data fromdifferent scenarios, perspectives, and altitudes. Furthermore, we propose anovel heuristic automatic annotation algorithm to generate accurate objectdetection labels. Finally, we utilize labels to generate text descriptions ofimages to make our UEMM-Air support more cross-modality tasks. In total, ourUEMM-Air consists of 120k pairs of images with 6 modalities and preciseannotations. Moreover, we conduct numerous experiments and establish newbenchmark results on our dataset. We also found that models pre-trained onUEMM-Air exhibit better performance on downstream tasks compared to othersimilar datasets. The dataset is publicly available(https://github.com/1e12Leon/UEMM-Air) to support the research of multi-modaltasks on UAVs.

Quick Read (beta)

loading the full paper ...