Abstract
Recent advances in large multimodal models (LMMs) have recognizedfine-grained grounding as an imperative factor of visual understanding anddialogue. However, the benefits of such representation in LMMs are limited tothe natural image domain, and these models perform poorly for remote sensing(RS). The distinct overhead viewpoint, scale variation, and presence of smallobjects in high-resolution RS imagery present a unique challenge inregion-level comprehension. Moreover, the development of the groundingconversation capability of LMMs within RS is hindered by the lack of granular,RS domain-specific grounded data. Addressing these limitations, we proposeGeoPixel - the first end-to-end high resolution RS-LMM that supportspixel-level grounding. This capability allows fine-grained visual perception bygenerating interleaved masks in conversation. GeoPixel supports up to 4K HDresolution in any aspect ratio, ideal for high-precision RS image analysis. Tosupport the grounded conversation generation (GCG) in RS imagery, we curate avisually grounded dataset GeoPixelD through a semi-automated pipeline thatutilizes set-of-marks prompting and spatial priors tailored for RS data tomethodically control the data generation process. GeoPixel demonstratessuperior performance in pixel-level comprehension, surpassing existing LMMs inboth single-target and multi-target segmentation tasks. Our methodologicalablation studies validate the effectiveness of each component in the overallarchitecture. Our code and data will be publicly released.