From Lifestyle Vlogs to Everyday Interactions

Abstract

A major stumbling block to progress in understanding basic humaninteractions, such as getting out of bed or opening a refrigerator, is lack ofgood training data. Most past efforts have gathered this data explicitly:starting with a laundry list of action labels, and then querying search enginesfor videos tagged with each label. In this work, we do the reverse and searchimplicitly: we start with a large collection of interaction-rich video data andthen annotate and analyze it. We use Internet Lifestyle Vlogs as the source ofsurprisingly large and diverse interaction data. We show that by collecting thedata first, we are able to achieve greater scale and far greater diversity interms of actions and actors. Additionally, our data exposes biases built intocommon explicitly gathered data. We make sense of our data by analyzing thecentral component of interaction -- hands. We benchmark two tasks: identifyingsemantic object contact at the video level and non-semantic contact state atthe frame level. We additionally demonstrate future prediction of hands.

Quick Read (beta)

loading the full paper ...