SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Abstract

The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundationmodel for object segmentation in both images and videos, paving the way forvarious downstream video applications. The crucial design of SAM 2 for videosegmentation is its memory module, which prompts object-aware memories fromprevious frames for current frame prediction. However, its greedy-selectionmemory design suffers from the "error accumulation" problem, where an erroredor missed mask will cascade and influence the segmentation of the subsequentframes, which limits the performance of SAM 2 toward complex long-term videos.To this end, we introduce SAM2Long, an improved training-free video objectsegmentation strategy, which considers the segmentation uncertainty within eachframe and chooses the video-level optimal results from multiple segmentationpathways in a constrained tree search manner. In practice, we maintain a fixednumber of segmentation pathways throughout the video. For each frame, multiplemasks are proposed based on the existing pathways, creating various candidatebranches. We then select the same fixed number of branches with highercumulative scores as the new pathways for the next frame. After processing thefinal frame, the pathway with the highest cumulative score is chosen as thefinal segmentation result. Benefiting from its heuristic search design,SAM2Long is robust toward occlusions and object reappearances, and caneffectively segment and track objects for complex long-term videos. Notably,SAM2Long achieves an average improvement of 3.0 points across all 24head-to-head comparisons, with gains of up to 5.3 points in J&F on long-termvideo object segmentation benchmarks such as SA-V and LVOS. The code isreleased at https://github.com/Mark12Ding/SAM2Long.

Quick Read (beta)

loading the full paper ...