Abstract
In this paper, we deal with the problem of temporal action localization for alarge-scale untrimmed cricket videos dataset. Our action of interest forcricket videos is a cricket stroke played by a batsman, which is, usually,covered by cameras placed at the stands of the cricket ground at both ends ofthe cricket pitch. After applying a sequence of preprocessing steps, we have~73 million frames for 1110 videos in the dataset at constant frame rate andresolution. The method of localization is a generalized one which applies atrained random forest model for CUTs detection(using summed up grayscalehistogram difference features) and two linear SVM camera models(CAM1 and CAM2)for first frame detection, trained on HOG features of CAM1 and CAM2 videoshots. CAM1 and CAM2 are assumed to be part of the cricket stroke. At thepredicted boundary positions, the HOG features of the first frames are computedand a simple algorithm was used to combine the positively predicted camerashots. In order to make the process as generic as possible, we did not considerany domain specific knowledge, such as tracking or specific shape and motionfeatures. The detailed analysis of our methodology is provided along with the metricsused for evaluation of individual models, and the final predicted segments. Weachieved a weighted mean TIoU of 0.5097 over a small sample of the test set.