Abstract
Accurate hyperspectral image (HSI) interpretation is critical for providingvaluable insights into various earth observation-related applications such asurban planning, precision agriculture, and environmental monitoring. However,existing HSI processing methods are predominantly task-specific andscene-dependent, which severely limits their ability to transfer knowledgeacross tasks and scenes, thereby reducing the practicality in real-worldapplications. To address these challenges, we present HyperSIGMA, a visiontransformer-based foundation model that unifies HSI interpretation across tasksand scenes, scalable to over one billion parameters. To overcome the spectraland spatial redundancy inherent in HSIs, we introduce a novel sparse samplingattention (SSA) mechanism, which effectively promotes the learning of diversecontextual features and serves as the basic block of HyperSIGMA. HyperSIGMAintegrates spatial and spectral features using a specially designed spectralenhancement module. In addition, we construct a large-scale hyperspectraldataset, HyperGlobal-450K, for pre-training, which contains about 450Khyperspectral images, significantly surpassing existing datasets in scale.Extensive experiments on various high-level and low-level HSI tasks demonstrateHyperSIGMA's versatility and superior representational capability compared tocurrent state-of-the-art methods. Moreover, HyperSIGMA shows significantadvantages in scalability, robustness, cross-modal transferring capability,real-world applicability, and computational efficiency. The code and modelswill be released at https://github.com/WHU-Sigma/HyperSIGMA.