Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

Abstract

While Word2Vec represents words (in text) as vectors carrying semanticinformation, audio Word2Vec was shown to be able to represent signal segmentsof spoken words as vectors carrying phonetic structure information. AudioWord2Vec can be trained in an unsupervised way from an unlabeled corpus, exceptthe word boundaries are needed. In this paper, we extend audio Word2Vec fromword-level to utterance-level by proposing a new segmental audio Word2Vec, inwhich unsupervised spoken word boundary segmentation and audio Word2Vec arejointly learned and mutually enhanced, so an utterance can be directlyrepresented as a sequence of vectors carrying phonetic structure information.This is achieved by a segmental sequence-to-sequence autoencoder (SSAE), inwhich a segmentation gate trained with reinforcement learning is inserted inthe encoder. Experiments on English, Czech, French and German show very goodperformance in both unsupervised spoken word segmentation and spoken termdetection applications (significantly better than frame-based DTW).

Quick Read (beta)

loading the full paper ...