Panoptic Scene Graph Generation

Abstract

Existing research addresses scene graph generation (SGG) -- a criticaltechnology for scene understanding in images -- from a detection perspective,i.e., objects are detected using bounding boxes followed by prediction of theirpairwise relationships. We argue that such a paradigm causes several problemsthat impede the progress of the field. For instance, bounding box-based labelsin current datasets usually contain redundant classes like hairs, and leave outbackground information that is crucial to the understanding of context. In thiswork, we introduce panoptic scene graph generation (PSG), a new problem taskthat requires the model to generate a more comprehensive scene graphrepresentation based on panoptic segmentations rather than rigid boundingboxes. A high-quality PSG dataset, which contains 49k well-annotatedoverlapping images from COCO and Visual Genome, is created for the community tokeep track of its progress. For benchmarking, we build four two-stagebaselines, which are modified from classic methods in SGG, and two one-stagebaselines called PSGTR and PSGFormer, which are based on the efficientTransformer-based detector, i.e., DETR. While PSGTR uses a set of queries todirectly learn triplets, PSGFormer separately models the objects and relationsin the form of queries from two Transformer decoders, followed by aprompting-like relation-object matching mechanism. In the end, we shareinsights on open challenges and future directions.

Quick Read (beta)

loading the full paper ...