Abstract
Recent advances in vision-language models (VLMs) have enabled impressivegeneralization across diverse video understanding tasks under zero-shotsettings. However, their capabilities in high-stakes industrial domains-whererecognizing both routine operations and safety-critical anomalies isessential-remain largely underexplored. To address this gap, we introduceiSafetyBench, a new video-language benchmark specifically designed to evaluatemodel performance in industrial environments across both normal and hazardousscenarios. iSafetyBench comprises 1,100 video clips sourced from real-worldindustrial settings, annotated with open-vocabulary, multi-label action tagsspanning 98 routine and 67 hazardous action categories. Each clip is pairedwith multiple-choice questions for both single-label and multi-labelevaluation, enabling fine-grained assessment of VLMs in both standard andsafety-critical contexts. We evaluate eight state-of-the-art video-languagemodels under zero-shot conditions. Despite their strong performance on existingvideo benchmarks, these models struggle with iSafetyBench-particularly inrecognizing hazardous activities and in multi-label scenarios. Our resultsreveal significant performance gaps, underscoring the need for more robust,safety-aware multimodal models for industrial applications. iSafetyBenchprovides a first-of-its-kind testbed to drive progress in this direction. Thedataset is available at: https://github.com/raiyaan-abdullah/iSafety-Bench.