Abstract
Isolated Sign Language Recognition (ISLR) focuses on identifying individualsign language glosses. Considering the diversity of sign languages acrossgeographical regions, developing region-specific ISLR datasets is crucial forsupporting communication and research. Auslan, as a sign language specific toAustralia, still lacks a dedicated large-scale word-level dataset for the ISLRtask. To fill this gap, we curate \underline{\textbf{the first}} large-scaleMulti-view Multi-modal Word-Level Australian Sign Language recognition dataset,dubbed MM-WLAuslan. Compared to other publicly available datasets, MM-WLAuslanexhibits three significant advantages: (1) the largest amount of data, (2) themost extensive vocabulary, and (3) the most diverse of multi-modal cameraviews. Specifically, we record 282K+ sign videos covering 3,215 commonly usedAuslan glosses presented by 73 signers in a studio environment. Moreover, ourfilming system includes two different types of cameras, i.e., three Kinect-V2cameras and a RealSense camera. We position cameras hemispherically around thefront half of the model and simultaneously record videos using all fourcameras. Furthermore, we benchmark results with state-of-the-art methods forvarious multi-modal ISLR settings on MM-WLAuslan, including multi-view,cross-camera, and cross-view. Experiment results indicate that MM-WLAuslan is achallenging ISLR dataset, and we hope this dataset will contribute to thedevelopment of Auslan and the advancement of sign languages worldwide. Alldatasets and benchmarks are available at MM-WLAuslan.