Analyzing autoencoder-based acoustic word embeddings

Abstract

Recent studies have introduced methods for learning acoustic word embeddings(AWEs)---fixed-size vector representations of words which encode their acousticfeatures. Despite the widespread use of AWEs in speech processing research,they have only been evaluated quantitatively in their ability to discriminatebetween whole word tokens. To better understand the applications of AWEs invarious downstream tasks and in cognitive modeling, we need to analyze therepresentation spaces of AWEs. Here we analyze basic properties of AWE spaceslearned by a sequence-to-sequence encoder-decoder model in six typologicallydiverse languages. We first show that these AWEs preserve some informationabout words' absolute duration and speaker. At the same time, therepresentation space of these AWEs is organized such that the distance betweenwords' embeddings increases with those words' phonetic dissimilarity. Finally,the AWEs exhibit a word onset bias, similar to patterns reported in variousstudies on human speech processing and lexical access. We argue this is apromising result and encourage further evaluation of AWEs as a potentiallyuseful tool in cognitive science, which could provide a link between speechprocessing and lexical memory.

Quick Read (beta)

loading the full paper ...