SAM-REF: Introducing Image-Prompt Synergy during Interaction for Detail Enhancement in the Segment Anything Model

Abstract

Interactive segmentation is to segment the mask of the target objectaccording to the user's interactive prompts. There are two mainstreamstrategies: early fusion and late fusion. Current specialist models utilize theearly fusion strategy that encodes the combination of images and prompts totarget the prompted objects, yet repetitive complex computations on the imagesresult in high latency. Late fusion models extract image embeddings once andmerge them with the prompts in later interactions. This strategy avoidsredundant image feature extraction and improves efficiency significantly. Arecent milestone is the Segment Anything Model (SAM). However, this strategylimits the models' ability to extract detailed information from the promptedtarget zone. To address this issue, we propose SAM-REF, a two-stage refinementframework that fully integrates images and prompts by using a lightweightrefiner into the interaction of late fusion, which combines the accuracy ofearly fusion and maintains the efficiency of late fusion. Through extensiveexperiments, we show that our SAM-REF model outperforms the currentstate-of-the-art method in most metrics on segmentation quality withoutcompromising efficiency.

Quick Read (beta)

loading the full paper ...