Abstract
Language models have demonstrated impressive ability in context understandingand generative performance. Inspired by the recent success of languagefoundation models, in this paper, we propose LMTraj (Language-based MultimodalTrajectory predictor), which recasts the trajectory prediction task into a sortof question-answering problem. Departing from traditional numerical regressionmodels, which treat the trajectory coordinate sequence as continuous signals,we consider them as discrete signals like text prompts. Specially, we firsttransform an input space for the trajectory coordinate into the naturallanguage space. Here, the entire time-series trajectories of pedestrians areconverted into a text prompt, and scene images are described as textinformation through image captioning. The transformed numerical and image dataare then wrapped into the question-answering template for use in a languagemodel. Next, to guide the language model in understanding and reasoninghigh-level knowledge, such as scene context and social relationships betweenpedestrians, we introduce an auxiliary multi-task question and answering. Wethen train a numerical tokenizer with the prompt data. We encourage thetokenizer to separate the integer and decimal parts well, and leverage it tocapture correlations between the consecutive numbers in the language model.Lastly, we train the language model using the numerical tokenizer and all ofthe question-answer prompts. Here, we propose a beam-search-based most-likelyprediction and a temperature-based multimodal prediction to implement bothdeterministic and stochastic inferences. Applying our LMTraj, we show that thelanguage-based model can be a powerful pedestrian trajectory predictor, andoutperforms existing numerical-based predictor methods. Code is publiclyavailable at https://github.com/inhwanbae/LMTrajectory .