Abstract
This paper develops a Versatile and Honest vision language Model (VHM) forremote sensing image analysis. VHM is built on a large-scale remote sensingimage-text dataset with rich-content captions (VersaD), and an honestinstruction dataset comprising both factual and deceptive questions (HnstD).Unlike prevailing remote sensing image-text datasets, in which image captionsfocus on a few prominent objects and their relationships, VersaD captionsprovide detailed information about image properties, object attributes, and theoverall scene. This comprehensive captioning enables VHM to thoroughlyunderstand remote sensing images and perform diverse remote sensing tasks.Moreover, different from existing remote sensing instruction datasets that onlyinclude factual questions, HnstD contains additional deceptive questionsstemming from the non-existence of objects. This feature prevents VHM fromproducing affirmative answers to nonsense queries, thereby ensuring itshonesty. In our experiments, VHM significantly outperforms various visionlanguage models on common tasks of scene classification, visual questionanswering, and visual grounding. Additionally, VHM achieves competentperformance on several unexplored tasks, such as building vectorizing,multi-label classification and honest question answering. We will release thecode, data and model weights at https://github.com/opendatalab/VHM .