Spatio-Temporal Event Segmentation and Localization for Wildlife Extended Videos

  • 2020-07-14 17:41:11
  • Ramy Mounir, Roman Gula, Jörn Theuerkauf, Sudeep Sarkar
  • 0

Abstract

Using offline training schemes, researchers have tackled the eventsegmentation problem by providing full or weak-supervision through manuallyannotated labels or self-supervised epoch-based training. Most works considervideos that are at most 10's of minutes long. We present a self-supervisedperceptual prediction framework capable of temporal event segmentation bybuilding stable representations of objects over time and demonstrate it on longvideos, spanning several days. The approach is deceptively simple but quiteeffective. We rely on predictions of high-level features computed by a standarddeep learning backbone. For prediction, we use an LSTM, augmented with anattention mechanism, trained in a self-supervised manner using the predictionerror. The self-learned attention maps effectively localize and track theevent-related objects in each frame. The proposed approach does not requirelabels. It requires only a single pass through the video, with no separatetraining set. Given the lack of datasets of very long videos, we demonstrateour method on video from 10 days (254 hours) of continuous wildlife monitoringdata that we had collected with required permissions. We find that the approachis robust to various environmental conditions such as day/night conditions,rain, sharp shadows, and windy conditions. For the task of temporally locatingevents, we had an 80% recall rate at 20% false-positive rate for frame-levelsegmentation. At the activity level, we had an 80% activity recall rate for onefalse activity detection every 50 minutes. We will make the dataset, which isthe first of its kind, and the code available to the research community.

 

Quick Read (beta)

loading the full paper ...