Locality-Aware Zero-Shot Human-Object Interaction Detection

Abstract

Recent methods for zero-shot Human-Object Interaction (HOI) detectiontypically leverage the generalization ability of large Vision-Language Model(VLM), i.e., CLIP, on unseen categories, showing impressive results on variouszero-shot settings. However, existing methods struggle to adapt CLIPrepresentations for human-object pairs, as CLIP tends to overlook fine-grainedinformation necessary for distinguishing interactions. To address this issue,we devise, LAIN, a novel zero-shot HOI detection framework enhancing thelocality and interaction awareness of CLIP representations. The localityawareness, which involves capturing fine-grained details and the spatialstructure of individual objects, is achieved by aggregating the information andspatial priors of adjacent neighborhood patches. The interaction awareness,which involves identifying whether and how a human is interacting with anobject, is achieved by capturing the interaction pattern between the human andthe object. By infusing locality and interaction awareness into CLIPrepresentation, LAIN captures detailed information about the human-objectpairs. Our extensive experiments on existing benchmarks show that LAINoutperforms previous methods on various zero-shot settings, demonstrating theimportance of locality and interaction awareness for effective zero-shot HOIdetection.

Quick Read (beta)

loading the full paper ...