Abstract
Both Convolutional Neural Networks (CNNs) and Transformers have shown greatsuccess in semantic segmentation tasks. Efforts have been made to integrateCNNs with Transformer models to capture both local and global contextinteractions. However, there is still room for enhancement, particularly whenconsidering constraints on computational resources. In this paper, we introduceHAFormer, a model that combines the hierarchical features extraction ability ofCNNs with the global dependency modeling capability of Transformers to tacklelightweight semantic segmentation challenges. Specifically, we design aHierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale localfeature extraction. During the global perception modeling, we devise anEfficient Transformer (ET) module streamlining the quadratic calculationsassociated with traditional Transformers. Moreover, a correlation-weightedFusion (cwF) module selectively merges diverse feature representations,significantly enhancing predictive accuracy. HAFormer achieves high performancewith minimal computational overhead and compact model size, achieving 74.2%mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of105FPS and 118FPS on a single 2080Ti GPU. The source codes are available athttps://github.com/XU-GITHUB-curry/HAFormer.