LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation

Abstract

Video object segmentation (VOS) aims to distinguish and track target objectsin a video. Despite the excellent performance achieved by off-the-shell VOSmodels, existing VOS benchmarks mainly focus on short-term videos lasting about5 seconds, where objects remain visible most of the time. However, thesebenchmarks poorly represent practical applications, and the absence oflong-term datasets restricts further investigation of VOS in realisticscenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videoswith 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last1.14 minutes on average, approximately 5 times longer than videos in existingdatasets. Each video includes various attributes, especially challengesderiving from the wild, such as long-term reappearing and cross-temporalsimilar objects. Compared to previous benchmarks, our LVOS better reflects VOSmodels' performance in real scenarios. Based on LVOS, we evaluate 20 existingVOS models under 4 different settings and conduct a comprehensive analysis. OnLVOS, these models suffer a large performance drop, highlighting the challengeof achieving precise tracking and segmentation in real-world scenarios.Attribute-based analysis indicates that key factor to accuracy decline is theincreased video length, emphasizing LVOS's crucial role. We hope our LVOS canadvance development of VOS in real scenes. Data and code are available athttps://lingyihongfd.github.io/lvos.github.io/.

Quick Read (beta)

loading the full paper ...