SANTLR: Speech Annotation Toolkit for Low Resource Languages

Abstract

While low resource speech recognition has attracted a lot of attention fromthe speech community, there are a few tools available to facilitate lowresource speech collection. In this work, we present SANTLR: Speech AnnotationToolkit for Low Resource Languages. It is a web-based toolkit which allowsresearchers to easily collect and annotate a corpus of speech in a low resourcelanguage. Annotators may use this toolkit for two purposes: transcription orrecording. In transcription, annotators would transcribe audio files providedby the researchers; in recording, annotators would record their voice byreading provided texts. We highlight two properties of this toolkit. First,SANTLR has a very user-friendly User Interface (UI). Both researchers andannotators may use this simple web interface to interact. There is norequirement for the annotators to have any expertise in audio or textprocessing. The toolkit would handle all preprocessing and postprocessingsteps. Second, we employ a multi-step ranking mechanism facilitate theannotation process. In particular, the toolkit would give higher priority toutterances which are easier to annotate and are more beneficial to achievingthe goal of the annotation, e.g. quickly training an acoustic model.

Quick Read (beta)

loading the full paper ...