Abstract
The combination of Spiking Neural Networks (SNNs) with Vision Transformerarchitectures has garnered significant attention due to their potential forenergy-efficient and high-performance computing paradigms. However, asubstantial performance gap still exists between SNN-based and ANN-basedtransformer architectures. While existing methods propose spikingself-attention mechanisms that are successfully combined with SNNs, the overallarchitectures proposed by these methods suffer from a bottleneck in effectivelyextracting features from different image scales. In this paper, we address thisissue and propose MSVIT. This novel spike-driven Transformer architecturefirstly uses multi-scale spiking attention (MSSA) to enhance the capabilitiesof spiking attention blocks. We validate our approach across various maindatasets. The experimental results show that MSVIT outperforms existingSNN-based models, positioning itself as a state-of-the-art solution amongSNN-transformer architectures. The codes are available athttps://github.com/Nanhu-AI-Lab/MSViT.