Abstract
Video Object Segmentation (VOS) is a core task in computer vision, requiringmodels to track and segment target objects across video frames. Despite notableadvances with recent efforts, current techniques still lag behind humancapabilities in handling drastic visual variations, occlusions, and complexscene changes. This limitation arises from their reliance on appearancematching, neglecting the human-like conceptual understanding of objects thatenables robust identification across temporal dynamics. Motivated by this gap,we propose Segment Concept (SeC), a concept-driven segmentation framework thatshifts from conventional feature matching to the progressive construction andutilization of high-level, object-centric representations. SeC employs LargeVision-Language Models (LVLMs) to integrate visual cues across diverse frames,constructing robust conceptual priors. During inference, SeC forms acomprehensive semantic representation of the target based on processed frames,realizing robust segmentation of follow-up frames. Furthermore, SeC adaptivelybalances LVLM-based semantic reasoning with enhanced feature matching,dynamically adjusting computational efforts based on scene complexity. Torigorously assess VOS methods in scenarios demanding high-level conceptualreasoning and robust semantic understanding, we introduce the Semantic ComplexScenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160manually annotated multi-scenario videos designed to challenge models withsubstantial appearance variations and dynamic scene transformations. Inparticular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS,establishing a new state-of-the-art in concept-aware video object segmentation.