Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

Abstract

Vision-language models (VLMs) have become a promising approach to enhancingperception and decision-making in autonomous driving. The gap remains inapplying VLMs to understand complex scenarios interacting with pedestrians andefficient vehicle deployment. In this paper, we propose a knowledgedistillation method that transfers knowledge from large-scale vision-languagefoundation models to efficient vision networks, and we apply it to pedestrianbehavior prediction and scene understanding tasks, achieving promising resultsin generating more diverse and comprehensive semantic attributes. We alsoutilize multiple pre-trained models and ensemble techniques to boost themodel's performance. We further examined the effectiveness of the model afterknowledge distillation; the results show significant metric improvements inopen-vocabulary perception and trajectory prediction tasks, which canpotentially enhance the end-to-end performance of autonomous driving.

Quick Read (beta)

loading the full paper ...