Abstract
In the domain of unsupervised learning most work on speech has focused ondiscovering low-level constructs such as phoneme inventories or word-likeunits. In contrast, for written language, where there is a large body of workon unsupervised induction of semantic representations of words, whole sentencesand longer texts. In this study we examine the challenges of adapting theseapproaches from written to spoken language. We conjecture that unsupervisedlearning of the semantics of spoken language becomes feasible if we abstractfrom the surface variability. We simulate this setting with a dataset ofutterances spoken by a realistic but uniform synthetic voice. We evaluate twosimple unsupervised models which, to varying degrees of success, learn semanticrepresentations of speech fragments. Finally we present inconclusive results onhuman speech, and discuss the challenges inherent in learning distributionalsemantic representations on unrestricted natural spoken language.