Abstract
Advanced by transformer architecture, vision foundation models (VFMs) achieveremarkable progress in performance and generalization ability. Segment AnythingModel (SAM) is one remarkable model that can achieve generalized segmentation.However, most VFMs cannot run in realtime, which makes it difficult to transferthem into several products. On the other hand, current real-time segmentationmainly has one purpose, such as semantic segmentation on the driving scene. Weargue that diverse outputs are needed for real applications. Thus, this workexplores a new real-time segmentation setting, named all-purpose segmentationin real-time, to transfer VFMs in real-time deployment. It contains threedifferent tasks, including interactive segmentation, panoptic segmentation, andvideo segmentation. We aim to use one model to achieve the above tasks inreal-time. We first benchmark several strong baselines. Then, we presentReal-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and anefficient decoupled decoder to perform prompt-driven decoding. Moreover, wefurther explore different training strategies and tuning methods to boostco-training performance further. Our code and model are available athttps://github.com/xushilin1/RAP-SAM/.