Abstract
Language-Guided object recognition in remote sensing imagery is crucial forlarge-scale mapping and automated data annotation. However, existingopen-vocabulary and visual grounding methods rely on explicit category cues,limiting their ability to handle complex or implicit queries that requireadvanced reasoning. To address this issue, we introduce a new suite of tasks,including Instruction-Oriented Object Counting, Detection, and Segmentation(InstructCDS), covering open-vocabulary, open-ended, and open-subclassscenarios. We further present EarthInstruct, the first InstructCDS benchmarkfor earth observation. It is constructed from two diverse remote sensingdatasets with varying spatial resolutions and annotation rules across 20categories, necessitating models to interpret dataset-specific instructions.Given the scarcity of semantically rich labeled data in remote sensing, wepropose InstructSAM, a training-free framework for instruction-driven objectrecognition. InstructSAM leverages large vision-language models to interpretuser instructions and estimate object counts, employs SAM2 for mask proposal,and formulates mask-label assignment as a binary integer programming problem.By integrating semantic similarity with counting constraints, InstructSAMefficiently assigns categories to predicted masks without relying on confidencethresholds. Experiments demonstrate that InstructSAM matches or surpassesspecialized baselines across multiple tasks while maintaining near-constantinference time regardless of object count, reducing output tokens by 89% andoverall runtime by over 32% compared to direct generation approaches. Webelieve the contributions of the proposed tasks, benchmark, and effectiveapproach will advance future research in developing versatile objectrecognition systems.