Distilling Multi-modal Large Language Models for Autonomous Driving

Abstract

Autonomous driving demands safe motion planning, especially in critical"long-tail" scenarios. Recent end-to-end autonomous driving systems leveragelarge language models (LLMs) as planners to improve generalizability to rareevents. However, using LLMs at test time introduces high computational costs.To address this, we propose DiMA, an end-to-end autonomous driving system thatmaintains the efficiency of an LLM-free (or vision-based) planner whileleveraging the world knowledge of an LLM. DiMA distills the information from amulti-modal LLM to a vision-based end-to-end planner through a set of speciallydesigned surrogate tasks. Under a joint training strategy, a scene encodercommon to both networks produces structured representations that aresemantically grounded as well as aligned to the final planning objective.Notably, the LLM is optional at inference, enabling robust planning withoutcompromising on efficiency. Training with DiMA results in a 37% reduction inthe L2 trajectory error and an 80% reduction in the collision rate of thevision-based planner, as well as a 44% trajectory error reduction in longtailscenarios. DiMA also achieves state-of-the-art performance on the nuScenesplanning benchmark.

Quick Read (beta)

loading the full paper ...