Abstract
Recent successes in autoregressive (AR) generation models, such as the GPTseries in natural language processing, have motivated efforts to replicate thissuccess in visual tasks. Some works attempt to extend this approach toautonomous driving by building video-based world models capable of generatingrealistic future video sequences and predicting ego states. However, priorworks tend to produce unsatisfactory results, as the classic GPT framework isdesigned to handle 1D contextual information, such as text, and lacks theinherent ability to model the spatial and temporal dynamics essential for videogeneration. In this paper, we present DrivingWorld, a GPT-style world model forautonomous driving, featuring several spatial-temporal fusion mechanisms. Thisdesign enables effective modeling of both spatial and temporal dynamics,facilitating high-fidelity, long-duration video generation. Specifically, wepropose a next-state prediction strategy to model temporal coherence betweenconsecutive frames and apply a next-token prediction strategy to capturespatial information within each frame. To further enhance generalizationability, we propose a novel masking strategy and reweighting strategy for tokenprediction to mitigate long-term drifting issues and enable precise control.Our work demonstrates the ability to produce high-fidelity and consistent videoclips of over 40 seconds in duration, which is over 2 times longer thanstate-of-the-art driving world models. Experiments show that, in contrast toprior works, our method achieves superior visual quality and significantly moreaccurate controllable future video generation. Our code is available athttps://github.com/YvanYin/DrivingWorld.