Hint-AD: Holistically Aligned Interpretability for End-to-End Autonomous Driving

CoRL 2024

Kairui Ding 1,3 , Boyuan Chen 1,3 , Yuchen Su 3 , Huan-ang Gao 1 , Bu Jin 1 , Chonghao Sima 4 , Xiaohui Li 2 , Wuqiang Zhang 2 , Paul Barsch 2 , Hongyang Li 4 , Hao Zhao †1
1 Institute for AI Industry Research (AIR), Tsinghua University
2 Mercedes-Benz Group China Ltd.
3 Xingjian College, Tsinghua University
4 OpenDriveLab, Shanghai AI Lab
Indicates Corresponding Author

Demonstration Video of Hint-AD.

Abstract

End-to-end architectures in autonomous driving (AD) face a significant challenge in interpretability, impeding human-AI trust. Human-friendly natural language has been explored for tasks such as driving explanation and 3D captioning. However, previous works primarily focused on the paradigm of declarative interpretability, where the natural language interpretations are not grounded in the intermediate outputs of AD systems, making the interpretations only declarative. In contrast, aligned interpretability establishes a connection between language and the intermediate outputs of AD systems. Here we introduce Hint-AD, an integrated AD-language system that generates language aligned with the holistic perception-prediction-planning outputs of the AD model. By incorporating the intermediate outputs and a holistic token mixer sub-network for effective feature adaptation, Hint-AD achieves desirable accuracy, achieving state-of-the-art results in driving language tasks including driving explanation, 3D dense captioning, and command prediction. To facilitate further study on driving explanation task on nuScenes, we also introduce a human-labeled dataset, Nu-X. Codes, dataset, and models will be publicly available.

Introduction & Method

Illustration of two paradigms for interpretability of end-to-end autonomous driving (AD) systems through natural language. (a) The declarative interpretability does not utilize intermediate outputs from AD systems, resulting in text that merely justifies the car's driving behavior; (b) Aligned interpretability incorporates intermediate outputs from the AD model to align the generated language with the holistic perception-prediction-planning process.


Framework of Hint-AD. (a) Hint-AD pipeline illustration. Taking intermediate output tokens from an AD pipeline as input, a language decoder generates natural language responses. A holistic token mixer module is designed to adapt the tokens. (b) Detailed illustration of BEV blocks architecture. (c) A detailed illustration of instance blocks architecture.

Dataset

Illustration of Nu-X dataset. Explanation serves as a guide for human learning and understanding. Particularly in the context of end-to-end autonomous driving (AD) systems, human users often seek explanations to bridge the gap between sensor inputs and AD behaviors. Currently, there is no dataset providing such explanations for nuScenes, a widely utilized dataset in AD research. To address this gap and facilitate interpretability-focused research on nuScenes, we introduce Nu-X, a comprehensive, large-scale, human-labeled explanation dataset. Nu-X offers detailed contextual information and diverse linguistic expressions for each of the 34,000 key frames in nuScenes.

Results

Qualitative Results. We present examples of the language output generated by Hint-AD across multiple tasks, including driving explanation, 3D dense captioning, VQA, command prediction, and four categories of alignment tasks. Captions that do not match the ground truth are colored in red.
Comparison with baselines. "Inter. outputs" represents intermediate outputs. All methods are adapted for BEV visual representation and employ mixed dataset training. Hint-UniAD and Hint-VAD, as two implementations of Hint-AD on different AD models, outperform baselines across four language tasks in the AD context.
Input Method Nu-X TOD NuScenes-QA Command Acc.
C B M R G C B M R H0 H1 All
Image + 6-shot examples GPT-4o 19.0 3.95 10.3 24.9 5.22 160.8 50.4 31.6 43.5 42.0 34.7 37.1 75.4
Gemini 1.5 17.6 3.43 9.3 23.4 5.03 169.7 53.6 33.4 45.9 40.5 32.9 35.4 80.9
BEV(2D) ADAPT 17.7 2.06 12.8 27.9 5.79 - - - - 51.0 44.2 46.4 79.3
BEV+Adapter 18.6 3.47 11.3 24.5 6.27 - - - - 51.8 45.6 47.7 81.1
BEV(2D) + Bounding Boxes BEVDet+MCAN 13.2 2.91 10.3 24.5 5.04 104.9 50.1 43.0 68.0 56.2 46.7 49.9 80.7
Vote2Cap-DETR 15.3 2.61 10.9 24.2 5.33 110.1 48.0 44.4 67.8 51.2 44.9 47.0 76.5
TOD 14.5 2.45 10.5 23.0 5.10 120.3 51.5 45.1 70.1 53.0 45.1 49.0 78.2
BEV(2D) + Inter. outputs Hint-UniAD (Ours) 21.7 4.20 12.7 27.0 7.20 342.6 71.9 48.0 85.4 56.2 47.5 50.4 83.0
Hint-VAD (Ours) 22.4 4.18 13.2 27.6 7.44 263.7 67.6 47.5 79.4 55.4 48.0 50.5 82.3

BibTeX

If you find our work useful in your research, please consider citing:
@inproceedings{dinghint,
      title={Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving},
      author={Ding, Kairui and Chen, Boyuan and Su, Yuchen and Gao, Huan-ang and Jin, Bu and Sima, Chonghao and Li, Xiaohui and Zhang, Wuqiang and Barsch, Paul and Li, Hongyang and others},
      booktitle={8th Annual Conference on Robot Learning}
      }