Towards Representative Subset Selection for Self-Supervised Speech Recognition

Abstract

Self-supervised speech recognition models require considerable labeledtraining data for learning high-fidelity representations for Automatic SpeechRecognition (ASR) which is computationally demanding and time-consuming,thereby hindering the usage of these models in resource-constrainedenvironments. We consider the task of identifying an optimal subset of data totrain self-supervised speech models for ASR. We make a surprising observationthat the dataset pruning strategies used in vision tasks for sampling the mostinformative examples do not perform better than random subset selection on thetask of fine-tuning self-supervised ASR. We then present the COWERAGE algorithmfor better subset selection in self-supervised ASR, which is based on ourfinding that ensuring the coverage of examples based on training Word ErrorRate (WER) in the early training epochs leads to better generalizationperformance. Extensive experiments on the wav2vec 2.0 model and TIMIT,Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE, with upto 17% absolute WER improvement over existing dataset pruning methods andrandom sampling. We also demonstrate that the coverage of training instances interms of WER ensures inclusion of phonemically diverse examples which leads tobetter test accuracy in self-supervised speech recognition models.

Quick Read (beta)

loading the full paper ...