OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

Abstract

The rise of multi-modal large language models(MLLMs) has spurred theirapplications in autonomous driving. Recent MLLM-based methods perform action bylearning a direct mapping from perception to action, neglecting the dynamics ofthe world and the relations between action and world dynamics. In contrast,human beings possess world model that enables them to simulate the futurestates based on 3D internal visual representation and plan actions accordingly.To this end, we propose OccLLaMA, an occupancy-language-action generative worldmodel, which uses semantic occupancy as a general visual representation andunifies vision-language-action(VLA) modalities through an autoregressive model.Specifically, we introduce a novel VQVAE-like scene tokenizer to efficientlydiscretize and reconstruct semantic occupancy scenes, considering its sparsityand classes imbalance. Then, we build a unified multi-modal vocabulary forvision, language and action. Furthermore, we enhance LLM, specifically LLaMA,to perform the next token/scene prediction on the unified vocabulary tocomplete multiple tasks in autonomous driving. Extensive experimentsdemonstrate that OccLLaMA achieves competitive performance across multipletasks, including 4D occupancy forecasting, motion planning, and visual questionanswering, showcasing its potential as a foundation model in autonomousdriving.

Quick Read (beta)

loading the full paper ...