Abstract
We introduce a new benchmark designed to advance the development ofgeneral-purpose, large-scale vision-language models for remote sensing images.Although several vision-language datasets in remote sensing have been proposedto pursue this goal, existing datasets are typically tailored to single tasks,lack detailed object information, or suffer from inadequate quality control.Exploring these improvement opportunities, we present a Versatilevision-language Benchmark for Remote Sensing image understanding, termedVRSBench. This benchmark comprises 29,614 images, with 29,614 human-verifieddetailed captions, 52,472 object references, and 123,221 question-answer pairs.It facilitates the training and evaluation of vision-language models across abroad spectrum of remote sensing image understanding tasks. We furtherevaluated state-of-the-art models on this benchmark for three vision-languagetasks: image captioning, visual grounding, and visual question answering. Ourwork aims to significantly contribute to the development of advancedvision-language models in the field of remote sensing. The data and code can beaccessed at https://github.com/lx709/VRSBench.