Doe-1: Closed-Loop Autonomous Driving with Large World Model

Abstract

End-to-end autonomous driving has received increasing attention due to itspotential to learn from large amounts of data. However, most existing methodsare still open-loop and suffer from weak scalability, lack of high-orderinteractions, and inefficient decision-making. In this paper, we explore aclosed-loop framework for autonomous driving and propose a large Driving wOrldmodEl (Doe-1) for unified perception, prediction, and planning. We formulateautonomous driving as a next-token generation problem and use multi-modaltokens to accomplish different tasks. Specifically, we use free-form texts(i.e., scene descriptions) for perception and generate future predictionsdirectly in the RGB space with image tokens. For planning, we employ aposition-aware tokenizer to effectively encode action into discrete tokens. Wetrain a multi-modal transformer to autoregressively generate perception,prediction, and planning tokens in an end-to-end and unified manner.Experiments on the widely used nuScenes dataset demonstrate the effectivenessof Doe-1 in various tasks including visual question-answering,action-conditioned video generation, and motion planning. Code:https://github.com/wzzheng/Doe.

Quick Read (beta)

loading the full paper ...