Abstract
End-to-end architectures in autonomous driving (AD) face a significantchallenge in interpretability, impeding human-AI trust. Human-friendly naturallanguage has been explored for tasks such as driving explanation and 3Dcaptioning. However, previous works primarily focused on the paradigm ofdeclarative interpretability, where the natural language interpretations arenot grounded in the intermediate outputs of AD systems, making theinterpretations only declarative. In contrast, aligned interpretabilityestablishes a connection between language and the intermediate outputs of ADsystems. Here we introduce Hint-AD, an integrated AD-language system thatgenerates language aligned with the holistic perception-prediction-planningoutputs of the AD model. By incorporating the intermediate outputs and aholistic token mixer sub-network for effective feature adaptation, Hint-ADachieves desirable accuracy, achieving state-of-the-art results in drivinglanguage tasks including driving explanation, 3D dense captioning, and commandprediction. To facilitate further study on driving explanation task onnuScenes, we also introduce a human-labeled dataset, Nu-X. Codes, dataset, andmodels will be publicly available.