Abstract
Predicting the motion of other agents in a scene is highly relevant forautonomous driving, as it allows a self-driving car to anticipate. Inspired bythe success of decoder-only models for language modeling, we propose DONUT, aDecoder-Only Network for Unrolling Trajectories. Unlike existingencoder-decoder forecasting models, we encode historical trajectories andpredict future trajectories with a single autoregressive model. This allows themodel to make iterative predictions in a consistent manner, and ensures thatthe model is always provided with up-to-date information, thereby enhancingperformance. Furthermore, inspired by multi-token prediction for languagemodeling, we introduce an 'overprediction' strategy that gives the model theauxiliary task of predicting trajectories at longer temporal horizons. Thisallows the model to better anticipate the future and further improvesperformance. Through experiments, we demonstrate that our decoder-only approachoutperforms the encoder-decoder baseline, and achieves new state-of-the-artresults on the Argoverse 2 single-agent motion forecasting benchmark.