Abstract
We propose the new task 'open-world video instance segmentation andcaptioning'. It requires to detect, segment, track and describe with richcaptions never before seen objects. This challenging task can be addressed bydeveloping "abstractors" which connect a vision model and a language foundationmodel. Concretely, we connect a multi-scale visual feature extractor and alarge language model (LLM) by developing an object abstractor and anobject-to-text abstractor. The object abstractor, consisting of a promptencoder and transformer blocks, introduces spatially-diverse open-world objectqueries to discover never before seen objects in videos. An inter-querycontrastive loss further encourages the diversity of object queries. Theobject-to-text abstractor is augmented with masked cross-attention and acts asa bridge between the object queries and a frozen LLM to generate rich anddescriptive object-centric captions for each detected object. Our generalizedapproach surpasses the baseline that jointly addresses the tasks of open-worldvideo instance segmentation and dense video object captioning by 13% on neverbefore seen objects, and by 10% on object-centric captions.