Abstract
For safe and robust autonomous driving, decision-making systems must effectively leverage past experiences to handle the inherent long-tail of traffic scenarios. Case-Based Reasoning (CBR) provides a natural paradigm for this by adapting solutions from prior cases. However, in complex and dynamic traffic environments, traditional CBR methods struggle to effectively abstract and adapt knowledge under uncertainty. Meanwhile, although multimodal large language models (MLLMs) exhibit strong perceptual and linguistic capabilities, their reasoning behavior often relies on empirical pattern fitting, limiting robustness under distribution shift and long-tail scenarios. We propose Traffic-MLLM, a retrieval-free neural case modeling framework for multimodal traffic reasoning. Instead of performing explicit case retrieval at inference time, Traffic-MLLM learns a structured and generalizable case space directly during training. To support this learning process, we construct a multi-source case base by integrating dynamic traffic videos and large-scale static visual question-answering data, serving as a unified training substrate for learning structured case representations. To further improve representation quality near knowledge boundaries, we introduce a curiosity-driven refinement mechanism based on Random Network Distillation (RND), encouraging the model to internalize cross-case structural regularities rather than surface correlations. Experiments on the SUTD-TrafficQA and DriveQA benchmarks demonstrate consistent improvements in dynamic reasoning, regulatory understanding, and cross-domain transfer. Traffic-MLLM achieves 50.8% accuracy on SUTD-TrafficQA, 74.8% on the CARLA-based DriveQA split, and 83.1% on the real-world Mapillary split, indicating that representation-level case-space refinement provides an effective alternative to explicit retrieval for scalable multimodal case adaptation.