Abstract
Accent classification or AC is a task to predict the accent type of an inpututterance, and it can be used as a preliminary step toward accented speechrecognition and accent conversion. Existing studies have often achieved suchclassification by training a neural network model to minimize theclassification error of the predicted accent label, which can be obtained as amodel output. Since we optimize the entire model only from the perspective ofclassification loss during training time in this approach, the model mightlearn to predict the accent type from irrelevant features, such as individualspeaker identity, which are not informative during test time. To address thisproblem, we propose a GE2E-AC, in which we train a model to extract accentembedding or AE of an input utterance such that the AEs of the same accentclass get closer, instead of directly minimizing the classification loss. Weexperimentally show the effectiveness of the proposed GE2E-AC, compared to thebaseline model trained with the conventional cross-entropy-based loss.