REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Abstract

Unsupervised automatic speech recognition (ASR) aims to learn the mappingbetween the speech signal and its corresponding textual transcription withoutthe supervision of paired speech-text data. A word/phoneme in the speech signalis represented by a segment of speech signal with variable length and unknownboundary, and this segmental structure makes learning the mapping betweenspeech and text challenging, especially without paired data. In this paper, wepropose REBORN,Reinforcement-Learned Boundary Segmentation with IterativeTraining for Unsupervised ASR. REBORN alternates between (1) training asegmentation model that predicts the boundaries of the segmental structures inspeech signals and (2) training the phoneme prediction model, whose input isthe speech feature segmented by the segmentation model, to predict a phonemetranscription. Since supervised data for training the segmentation model is notavailable, we use reinforcement learning to train the segmentation model tofavor segmentations that yield phoneme sequence predictions with a lowerperplexity. We conduct extensive experiments and find that under the samesetting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech,TIMIT, and five non-English languages in Multilingual LibriSpeech. Wecomprehensively analyze why the boundaries learned by REBORN improve theunsupervised ASR performance.

Quick Read (beta)

loading the full paper ...