Abstract
Autonomous driving systems face significant challenges in handlingunpredictable edge-case scenarios, such as adversarial pedestrian movements,dangerous vehicle maneuvers, and sudden environmental changes. Currentend-to-end driving models struggle with generalization to these rare events dueto limitations in traditional detection and prediction approaches. To addressthis, we propose INSIGHT (Integration of Semantic and Visual Inputs forGeneralized Hazard Tracking), a hierarchical vision-language model (VLM)framework designed to enhance hazard detection and edge-case evaluation. Byusing multimodal data fusion, our approach integrates semantic and visualrepresentations, enabling precise interpretation of driving scenarios andaccurate forecasting of potential dangers. Through supervised fine-tuning ofVLMs, we optimize spatial hazard localization using attention-based mechanismsand coordinate regression techniques. Experimental results on the BDD100Kdataset demonstrate a substantial improvement in hazard predictionstraightforwardness and accuracy over existing models, achieving a notableincrease in generalization performance. This advancement enhances therobustness and safety of autonomous driving systems, ensuring improvedsituational awareness and potential decision-making in complex real-worldscenarios.