Abstract
Transformer-based segmentation methods face the challenge of efficientinference when dealing with high-resolution images. Recently, several linearattention architectures, such as Mamba and RWKV, have attracted much attentionas they can process long sequences efficiently. In this work, we focus ondesigning an efficient segment-anything model by exploring these differentarchitectures. Specifically, we design a mixed backbone that containsconvolution and RWKV operation, which achieves the best for both accuracy andefficiency. In addition, we design an efficient decoder to utilize themultiscale tokens to obtain high-quality masks. We denote our method asRWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, webuild a benchmark containing various high-quality segmentation datasets andjointly train one efficient yet high-quality segmentation model using thisbenchmark. Based on the benchmark results, our RWKV-SAM achieves outstandingperformance in efficiency and segmentation quality compared to transformers andother linear attention models. For example, compared with the same-scaletransformer model, RWKV-SAM achieves more than 2x speedup and can achievebetter segmentation performance on various datasets. In addition, RWKV-SAMoutperforms recent vision Mamba models with better classification and semanticsegmentation results. Code and models will be publicly available.