Graphical Contrastive Losses for Scene Graph Generation

Abstract

Most scene graph generators use a two-stage pipeline to detect visualrelationships: the first stage detects entities, and the second predicts thepredicate for each entity pair using a softmax distribution. We find that suchpipelines, trained with only a cross entropy loss over predicate classes,suffer from two common errors. The first, Entity Instance Confusion, occurswhen the model confuses multiple instances of the same type of entity (e.g.multiple cups). The second, Proximal Relationship Ambiguity, arises whenmultiple subject-predicate-object triplets appear in close proximity with thesame predicate, and the model struggles to infer the correct subject-objectpairings (e.g. mis-pairing musicians and their instruments). We propose a setof contrastive loss formulations that specifically target these types of errorswithin the scene graph generation problem, collectively termed the GraphicalContrastive Losses. These losses explicitly force the model to disambiguaterelated and unrelated instances through margin constraints specific to eachtype of confusion. We further construct a relationship detector, called RelDN,using the aforementioned pipeline to demonstrate the efficacy of our proposedlosses. Our model outperforms the winning method of the OpenImages RelationshipDetection Challenge by 4.7\% (16.5\% relative) on the test set. We also showimproved results over the best previous methods on the Visual Genome and VisualRelationship Detection datasets.

Quick Read (beta)

loading the full paper ...