Abstract
Realistic human surveillance datasets are crucial for training and evaluatingcomputer vision models under real-world conditions, facilitating thedevelopment of robust algorithms for human and human-interacting objectdetection in complex environments. These datasets need to offer diverse andchallenging data to enable a comprehensive assessment of model performance andthe creation of more reliable surveillance systems for public safety. To thisend, we present two visual object detection benchmarks named OD-VIRAT Large andOD-VIRAT Tiny, aiming at advancing visual understanding tasks in surveillanceimagery. The video sequences in both benchmarks cover 10 different scenes ofhuman surveillance recorded from significant height and distance. The proposedbenchmarks offer rich annotations of bounding boxes and categories, whereOD-VIRAT Large has 8.7 million annotated instances in 599,996 images andOD-VIRAT Tiny has 288,901 annotated instances in 19,860 images. This work alsofocuses on benchmarking state-of-the-art object detection architectures,including RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on this objectdetection-specific variant of VIRAT dataset. To the best of our knowledge, itis the first work to examine the performance of these recently publishedstate-of-the-art object detection architectures on realistic surveillanceimagery under challenging conditions such as complex backgrounds, occludedobjects, and small-scale objects. The proposed benchmarking and experimentalsettings will help in providing insights concerning the performance of selectedobject detection models and set the base for developing more efficient androbust object detection architectures.