Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction

Abstract

Reasoning over visual data is a desirable capability for robotics andvision-based applications. Such reasoning enables forecasting of the nextevents or actions in videos. In recent years, various models have beendeveloped based on convolution operations for prediction or forecasting, butthey lack the ability to reason over spatiotemporal data and infer therelationships of different objects in the scene. In this paper, we present aframework based on graph convolution to uncover the spatiotemporalrelationships in the scene for reasoning about pedestrian intent. A scene graphis built on top of segmented object instances within and across video frames.Pedestrian intent, defined as the future action of crossing or not-crossing thestreet, is a very crucial piece of information for autonomous vehicles tonavigate safely and more smoothly. We approach the problem of intent predictionfrom two different perspectives and anticipate the intention-to-cross withinboth pedestrian-centric and location-centric scenarios. In addition, weintroduce a new dataset designed specifically for autonomous-driving scenariosin areas with dense pedestrian populations: the Stanford-TRI Intent Prediction(STIP) dataset. Our experiments on STIP and another benchmark dataset show thatour graph modeling framework is able to predict the intention-to-cross of thepedestrians with an accuracy of 79.10% on STIP and 79.28% on \rev{JointAttention for Autonomous Driving (JAAD) dataset up to one second earlier thanwhen the actual crossing happens. These results outperform the baseline andprevious work. Please refer to http://stip.stanford.edu/ for the dataset andcode.

Quick Read (beta)

loading the full paper ...