Abstract
Video Anomaly Detection (VAD) aims to identify and locate deviations fromnormal patterns in video sequences. Traditional methods often struggle withsubstantial computational demands and a reliance on extensive labeled datasets,thereby restricting their practical applicability. To address theseconstraints, we propose HiProbe-VAD, a novel framework that leveragespre-trained Multimodal Large Language Models (MLLMs) for VAD without requiringfine-tuning. In this paper, we discover that the intermediate hidden states ofMLLMs contain information-rich representations, exhibiting higher sensitivityand linear separability for anomalies compared to the output layer. Tocapitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP)mechanism that intelligently identifies and extracts the most informativehidden states from the optimal intermediate layer during the MLLMs reasoning.Then a lightweight anomaly scorer and temporal localization module efficientlydetects anomalies using these extracted hidden states and finally generateexplanations. Experiments on the UCF-Crime and XD-Violence datasets demonstratethat HiProbe-VAD outperforms existing training-free and most traditionalapproaches. Furthermore, our framework exhibits remarkable cross-modelgeneralization capabilities in different MLLMs without any tuning, unlockingthe potential of pre-trained MLLMs for video anomaly detection and paving theway for more practical and scalable solutions.