Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Abstract

Analyses of self-supervised speech models have begun to reveal where and howthey represent different types of information. However, almost all analyseshave focused on English. Here, we examine how wav2vec2 models trained on fourdifferent languages encode both language-matched and non-matched speech. We useprobing classifiers and geometric analyses to examine how phones, lexicaltones, and speaker information are represented. We show that for allpretraining and test languages, the subspaces encoding phones, tones, andspeakers are largely orthogonal, and that layerwise patterns of probingaccuracy are similar, with a relatively small advantage for matched-languagephone and tone (but not speaker) probes in the later layers. Our findingssuggest that the structure of representations learned by wav2vec2 is largelyindependent of the speech material used during pretraining.

Quick Read (beta)

loading the full paper ...