MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Abstract

Video object segmentation (VOS) aims to segment specified target objectsthroughout a video. Although state-of-the-art methods have achieved impressiveperformance (e.g., 90+% J&F) on existing benchmarks such as DAVIS andYouTube-VOS, these datasets primarily contain salient, dominant, and isolatedobjects, limiting their generalization to real-world scenarios. To advance VOStoward more realistic environments, coMplex video Object SEgmentation (MOSEv1)was introduced to facilitate VOS research in complex scenes. Building on thestrengths and limitations of MOSEv1, we present MOSEv2, a significantly morechallenging dataset designed to further advance VOS methods under real-worldconditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masksfor 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2introduces significantly greater scene complexity, including more frequentobject disappearance and reappearance, severe occlusions and crowding, smallerobjects, as well as a range of new challenges such as adverse weather (e.g.,rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shotsequences, camouflaged objects, non-physical targets (e.g., shadows,reflections), scenarios requiring external knowledge, etc. We benchmark 20representative VOS methods under 5 different settings and observe consistentperformance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9%on MOSEv2. We further evaluate 9 video object tracking methods and find similardeclines, demonstrating that MOSEv2 presents challenges across tasks. Theseresults highlight that despite high accuracy on existing datasets, current VOSmethods still struggle under real-world complexities. MOSEv2 is publiclyavailable at https://MOSE.video.

Quick Read (beta)

loading the full paper ...