Abstract
Segmentation models can recognize a pre-defined set of objects in images.However, models that can reason over complex user queries that implicitly referto multiple objects of interest are still in their infancy. Recent advances inreasoning segmentation--generating segmentation masks from complex, implicitquery text--demonstrate that vision-language models can operate across an opendomain and produce reasonable outputs. However, our experiments show that suchmodels struggle with complex remote-sensing imagery. In this work, we introduceLISAt, a vision-language model designed to describe complex remote-sensingscenes, answer questions about them, and segment objects of interest. Wetrained LISAt on a new curated geospatial reasoning-segmentation dataset, GRES,with 27,615 annotations over 9,205 images, and a multimodal pretrainingdataset, PreGRES, containing over 1 million question-answer pairs. LISAtoutperforms existing geospatial foundation models such as RS-GPT4V by over10.04 % (BLEU-4) on remote-sensing description tasks, and surpassesstate-of-the-art open-domain models on reasoning segmentation tasks by 143.36 %(gIoU). Our model, datasets, and code are available athttps://lisat-bair.github.io/LISAt/