MVTamperBench: Evaluating Robustness of Vision-Language Models

Abstract

Recent advancements in Vision-Language Models (VLMs) have enabled significantprogress in complex video understanding tasks. However, their robustness toreal-world manipulations remains underexplored, limiting their reliability incritical applications. To address this gap, we introduce MVTamperBench, acomprehensive benchmark designed to evaluate VLM's resilience to videotampering effects, including rotation, dropping, masking, substitution, andrepetition. By systematically assessing state-of-the-art models, MVTamperBenchreveals substantial variability in robustness, with models like InternVL2-8Bachieving high performance, while others, such as Llama-VILA1.5-8B, exhibitsevere vulnerabilities. To foster broader adoption and reproducibility,MVTamperBench is integrated into VLMEvalKit, a modular evaluation toolkit,enabling streamlined testing and facilitating advancements in model robustness.Our benchmark represents a critical step towards developing tamper-resilientVLMs, ensuring their dependability in real-world scenarios. Project Page: https://amitbcp.github.io/MVTamperBench/

Quick Read (beta)

loading the full paper ...