Abstract
In this paper, we introduce a novel human interaction detection approach,based on CALIPSO (Classifying ALl Interacting Pairs in a Single shOt), aclassifier of human-object interactions. This new single-shot interactionclassifier estimates interactions simultaneously for all human-object pairs,regardless of their number and class. State-of-the-art approaches adopt amulti-shot strategy based on a pairwise estimate of interactions for a set ofhuman-object candidate pairs, which leads to a complexity depending, at least,on the number of interactions or, at most, on the number of candidate pairs. Incontrast, the proposed method estimates the interactions on the whole image.Indeed, it simultaneously estimates all interactions between all human subjectsand object targets by performing a single forward pass throughout the image.Consequently, it leads to a constant complexity and computation timeindependent of the number of subjects, objects or interactions in the image. Indetail, interaction classification is achieved on a dense grid of anchorsthanks to a joint multi-task network that learns three complementary taskssimultaneously: (i) prediction of the types of interaction, (ii) estimation ofthe presence of a target and (iii) learning of an embedding which mapsinteracting subject and target to a same representation, by using a metriclearning strategy. In addition, we introduce an object-centric passive-voiceverb estimation which significantly improves results. Evaluations on the twowell-known Human-Object Interaction image datasets, V-COCO and HICO-DET,demonstrate the competitiveness of the proposed method (2nd place) compared tothe state-of-the-art while having constant computation time regardless of thenumber of objects and interactions in the image.