Abstract
Driving World Models (DWMs) have become essential for autonomous driving byenabling future scene prediction. However, existing DWMs are limited to scenegeneration and fail to incorporate scene understanding, which involvesinterpreting and reasoning about the driving environment. In this paper, wepresent a unified Driving World Model named HERMES. We seamlessly integrate 3Dscene understanding and future scene evolution (generation) through a unifiedframework in driving scenarios. Specifically, HERMES leverages a Bird's-EyeView (BEV) representation to consolidate multi-view spatial information whilepreserving geometric relationships and interactions. We also introduce worldqueries, which incorporate world knowledge into BEV features via causalattention in the Large Language Model, enabling contextual enrichment forunderstanding and generation tasks. We conduct comprehensive studies onnuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of ourmethod. HERMES achieves state-of-the-art performance, reducing generation errorby 32.4% and improving understanding metrics such as CIDEr by 8.0%. The modeland code will be publicly released at https://github.com/LMD0311/HERMES.