Abstract
Affordance grounding-localizing object regions based on natural languagedescriptions of interactions-is a critical challenge for enabling intelligentagents to understand and interact with their environments. However, this taskremains challenging due to the need for fine-grained part-level localization,the ambiguity arising from multiple valid interaction regions, and the scarcityof large-scale datasets. In this work, we introduce Affogato, a large-scalebenchmark comprising 150K instances, annotated with open-vocabulary textdescriptions and corresponding 3D affordance heatmaps across a diverse set ofobjects and interactions. Building on this benchmark, we develop simple yeteffective vision-language models that leverage pretrained part-aware visionbackbones and a text-conditional heatmap decoder. Our models trained with theAffogato dataset achieve promising performance on the existing 2D and 3Dbenchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domaingeneralization. The Affogato dataset is shared in public:https://huggingface.co/datasets/project-affogato/affogato