Abstract
For languages with insufficient resources to train speech recognitionsystems, query-by-example spoken term detection (QbE-STD) offers a way ofaccessing an untranscribed speech corpus by helping identify regions wherespoken query terms occur. Yet retrieval performance can be poor when the queryand corpus are spoken by different speakers and produced in different recordingconditions. Using data selected from a variety of speakers and recordingconditions from 7 Australian Aboriginal languages and a regional variety ofDutch, all of which are endangered or vulnerable, we evaluated whether QbE-STDperformance on these languages could be improved by leveraging representationsextracted from the pre-trained English wav2vec 2.0 model. Compared to the useof Mel-frequency cepstral coefficients and bottleneck features, we find thatrepresentations from the middle layers of the wav2vec 2.0 Transformer offerlarge gains in task performance (between 56% and 86%). While features extractedusing the pre-trained English model yielded improved detection on all theevaluation languages, better detection performance was associated with theevaluation language's phonological similarity to English.