Abstract
Existing world models for autonomous driving struggle with long-horizongeneration and generalization to challenging scenarios. In this work, wedevelop a model using simple design choices, and without additional supervisionor sensors, such as maps, depth, or multiple cameras. We show that our modelyields state-of-the-art performance, despite having only 469M parameters andbeing trained on 280h of video data. It particularly stands out in difficultscenarios like turning maneuvers and urban traffic. We test whether discretetoken models possibly have advantages over continuous models based on flowmatching. To this end, we set up a hybrid tokenizer that is compatible withboth approaches and allows for a side-by-side comparison. Our study concludesin favor of the continuous autoregressive model, which is less brittle onindividual design choices and more powerful than the model built on discretetokens. Code, models and qualitative results are publicly available athttps://lmb-freiburg.github.io/orbis.github.io/.