Automatic spoken language assessment systems are becoming more popular inorder to handle increasing interests in second language learning. One challengefor these systems is to detect malpractice. Malpractice can take a range offorms, this paper focuses on detecting when a candidate attempts to impersonateanother in a speaking test. This form of malpractice is closely related tospeaker verification, but applied in the specific domain of spoken languageassessment. Advanced speaker verification systems, which leverage deep-learningapproaches to extract speaker representations, have been successfully appliedto a range of native speaker verification tasks. These systems are explored fornon-native spoken English data in this paper. The data used for speakerenrolment and verification is mainly taken from the BULATS test, which assessesEnglish language skills for business. Performance of systems trained onrelatively limited amounts of BULATS data, and standard large speakerverification corpora, is compared. Experimental results on large-scale testsets with millions of trials show that the best performance is achieved byadapting the imported model to non-native data. Breakdown of impostor trialsacross different first languages (L1s) and grades is analysed, which shows thatinter-L1 impostors are more challenging for speaker verification systems.
Quick Read (beta)
Non-native Speaker Verification for Spoken Language Assessment
Automatic spoken language assessment systems are becoming more popular in order to handle increasing interests in second language learning. One challenge for these systems is to detect malpractice. Malpractice can take a range of forms, this paper focuses on detecting when a candidate attempts to impersonate another in a speaking test. This form of malpractice is closely related to speaker verification, but applied in the specific domain of spoken language assessment. Advanced speaker verification systems, which leverage deep-learning approaches to extract speaker representations, have been successfully applied to a range of native speaker verification tasks. These systems are explored for non-native spoken English data in this paper. The data used for speaker enrolment and verification is mainly taken from the BULATS test, which assesses English language skills for business. Performance of systems trained on relatively limited amounts of BULATS data, and standard large speaker verification corpora, is compared. Experimental results on large-scale test sets with millions of trials show that the best performance is achieved by adapting the imported model to non-native data. Breakdown of impostor trials across different first languages (L1s) and grades is analysed, which shows that inter-L1 impostors are more challenging for speaker verification systems.
Non-native Speaker Verification for Spoken Language Assessment
|Linlin Wang, Yu Wang, Mark J. F. Gales††thanks: This paper reports on research supported by Cambridge Assessment, University of Cambridge. Thanks to Cambridge English Language Assessment for support and access to the BULATS and Linguaskill data. Both authors contributed equally. The authors would also like to thank Dr Kate Knill and Dr Anton Ragni for valuable discussions during the preparation of this manuscript.|
|ALTA Institute / Engineering Department, Cambridge University, UK|
Index Terms— speaker verification, non-native speech
Automatic spoken assessment systems are becoming increasingly popular, especially for English with the high demand around the world for learning of English as a second language [Zechner2009, Witt2000, Metallinou2014, Wang2018a]. In addition to assessing a candidate’s English ability such as fluency and pronunciation and giving feedback to the candidate, these automatic systems also need to ensure the integrity of the candidate’s score by detecting malpractice, as shown in Figure 1. Malpractice is the action by a candidate that breaks the assessment regulation and potentially threatens the reliability of the exam and associated certification. Malpractice can take a range of forms in spoken language assessment scenarios, such as using or trying to use unauthorised materials, impersonation, speaking irrelevant to prompts/questions, speaking in his/her first language (L1) instead of the target language for spoken tests, etc. This work aims to investigate the problem of automatically detecting impersonation, in which a candidate attempts to impersonate another in a speaking test. This is closely related to speaker verification.
Speaker verification is the process to accept or reject an identity claim by comparing the speaker-specific information extracted from the verification speech with that from the enrolment speech of the claimed identity. These approaches can be directly applied to detect impersonation in spoken language tests. The performance of speaker verification systems has advanced considerably in the last decade with the development of i-vector modelling [Dehak2011], in which a speech segment or a speaker is represented as a low-dimensional feature vector. Extraction of i-vectors is normally based on a Gaussian mixture model (GMM) based universal background model (UBM). This fixed length representation can then be used with a probabilistic linear discriminant analysis (PLDA) model to produce verification scores by comparing speaker representations, which are then used to make valid or impostor speaker decisions [Prince2007, Kenny2010, Garcia2011, Garcia2012]. Recently, with developments in deep learning, performance of speaker verification systems has been improved by replacing the GMM with a deep neural network (DNN) to derive statistics for extracting speaker representations. This DNN is usually trained to take a fixed length window of the acoustics and discriminate between speakers using supplied speaker labels as targets. To handle the variable-length nature of the acoustic signal, a pooling layer is used to yield the final fixed-dimensional speaker representation. In [Variani2014], a DNN was trained at the frame level, and pooling was performed by averaging activation vectors of the last hidden layer over all frames of an input utterance. In [Snyder2016, Snyder2017, Snyder2018], segment-level embeddings were extracted, which are referred to as x-vectors [Snyder2018] with data augmentation. By leveraging data augmentation based on background noise and acoustic reverberation, these x-vectors based systems can achieve better performance than i-vector and d-vector based systems on standard speaker verification tasks.
There has been some previous work on tasks related to non-native speech data using speaker verification approaches, such as detection of non-native speech [Shriberg2008], classification of native/non-native English [Tan2010] and L1 detection [Omar2010]. In [Qian2016], meta-data (L1) sensitive bottleneck features were employed within the i-vector framework to improve the performance of speaker verification with non-native speech. In contrast, this paper focuses on making use of the state-of-the-art deep-learning based speaker verification approaches to detect candidate impersonation in an English speaking test. As there is limited amounts of data available for the non-native learner task, it is of interest to investigate adapting a standard speaker verification task to this non-native task. Here a system based on the VoxCeleb dataset [Nagrani2017, Chung2018] is adapted to the BULATS task. Two forms of adaptation are examined: modifying the PLDA distance measure; and adapting the process for extracting the speaker representation by “fine-tuning” the network to the target domain. Furthermore, detailed analysis of performance is also done with respect to speaker attributes. Gender is an important attribute in impostor selection for standard speaker verification tasks, and for non-native speech, there are two additional speaker attributes: the L1 and the language proficiency level11 1 Language ability level is referred to as “grade” in this work., which should also be taken into consideration for speaker verification.
This paper is organised as follows. Section 2 gives an overview of speaker verification systems, and Section 3 introduces the non-native spoken English corpora used in this work. Experimental setup is described in Section 4, results and analysis are detailed in Section 5, and finally, conclusions are drawn in Section 6.
2 Speaker Verification Systems
In this work both i-vector and x-vector representations are used. For the i-vector speaker representation the form described in [Dehak2011, Povey2011] is used. This section will just discuss the x-vector speaker representation as this is the form that is adapted to the non-native verification task.
2.1 Deep neural network embedding extractor
There are three blocks to form the DNN for extracting the utterance-level speaker representation, or embedding. The first block of the deep embedding extractor is a frame-level feature extractor. The input to this block is a sequence of acoustic feature vectors of frames. This part normally consists of a number of hidden layers such as long short-term memory (LSTM) [Heigold2016] or time delay neural network (TDNN) layers [Snyder2017, Snyder2018]. The activations of the last hidden layer of this block for the input frames, , form the input to the second block which is a statistics pooling layer. This layer converts variable-length frame-level features into a fixed-dimensional vector by calculating the mean vector, and standard deviation vector of the frame-level feature vectors over the frames. The third block takes the statistics as the input and produces utterance-level representations using a number of stacked fully-connected hidden layers. The output of the DNN extractor is a softmax layer, and each of the nodes corresponds to one speaker identity. This DNN extractor is trained based on a cross-entropy loss function using the supplied speaker labels to get the targets. Consider there are training segments and speakers, the cross-entropy can be written as
where represents the parameters of the DNN and represents the Kronecker delta function. represents that the speaker label for segment is . After the DNN is trained, the utterance-level embeddings, , are normally extracted from the output of the affine component that is with or without the nonlinear activation function applied of one hidden layer in the third block of the DNN [Snyder2017, Snyder2018].
2.2 PLDA classifier and adaptation
After the speaker embeddings are extracted, they are used to train a PLDA model that yields the score (distance) between speaker embeddings. The training of the PLDA models aims to maximise the between-speaker difference and minimise the within-speaker variation, typically using expectation maximisation (EM). A number of variants of PLDA models have been introduced into the speaker verification task based on this “standard” PLDA [Prince2007]: two-covariance PLDA [Brummer2010] and heavy-tailed PLDA [Kenny2010]. The variant implemented in the Kaldi toolkit [Povey2011], and used in this work, follows [Ioffe2006] and is similar to the two-covariance model. This model can be written as
where is the speaker embedding. The vector represents the underlying speaker vector and represents its mean. is the Gaussian noise vector. For speaker verification tasks, estimation of this PLDA model can be performed by estimating the between-speaker covariance matrix, , and within-speaker covariance matrix, , using the EM algorithm.
PLDA is a powerful approach to classifying speakers given a large amounts of training data with speaker labels [Garcia2014, Garcia2014a, Villalba2014]. However, large amounts of labelled training data may not be available in the domain of interest such as the one considered in this paper, the non-native speaker verification. One approach to alleviate this problem is to do adaptation from a pre-trained out-of-domain model to the target domain. There are a number of methods for adapting the PLDA model in both supervised and unsupervised manners [Garcia2014b, Villalba2014]. The Kaldi toolkit implements an unsupervised adaptation method which does not require knowledge of speaker labels [Povey2011]. This method aims at adapting and of the out-of-domain PLDA model to better match the total covariance of the in-domain adaptation data.
3 Non-native Spoken English Corpora
The Business Language Testing Service (BULATS) test of Cambridge Assessment English [chambers-2011-bulats] is a multi-level computer-based English test. It consists of read speech and free-speaking components, with the candidate responding to prompts. The BULATS spoken test has five sections, all with materials appropriate to business scenarios. The first section (A) contains eight questions about the candidate and their work. The second section (B) is a read-aloud section in which the candidates are asked to read eight sentences. The last three sections (C, D and E) have longer utterances of spontaneous speech elicited by prompts. In section C the candidates are asked to talk for one minute about a prompted business related topic. In section D, the candidate has one minute to describe a business situation illustrated in graphs or charts, such as pie or bar charts. The prompt for section E asks the candidate to imagine they are in a specific conversation and to respond to questions they may be asked in that situation (e.g. advice about planning a conference). This section is made up of 5x 20 seconds responses.
Each section is scored between 0 and 6; the overall score is therefore between 0 and 30. This score is then mapped into Common European Framework of Reference (CEFR) [CEFR2001] language proficiency levels, which is an international standard for describing language ability on a six-level scale. Each candidate is finally assigned a “grade”, ranging from minimal (A1) and basic (A2) command, through limited but effective (B1) and generally effective (B2) command, to good operational (C1) and fully operational (C2) command of the spoken language.
In this work, non-native speech from the BULATS test is used as both training and test data for the speaker verification systems. To investigate how the systems generalise, data for testing is also taken from the Cambridge Assessment English Linguaskill 22 2 https://www.cambridgeenglish.org/exams-and-tests/linguaskill/ online test. Like BULATS, this is also a multi-level test and has a similar format composed of the same five sections as described before but assesses general English ability.
4 Experimental Setup
A set of 8,480 candidates from BULATS was used for training. The approximately 280 hours of speech covers a wide range of more than 70 different L1s. There are 15 major L1s with more than 100 candidates for each, including Tamil, Gujarati, Hindi, Telugu, Malayalam, Bengali, Spanish, Russian, Kannada, Portuguese, French, etc. Data augmentation was applied to the training set, and each recording was processed with a randomly selected source from “babble”, “music”, “noise” and “reverb” [Snyder2018], which roughly doubled the size of the original training set. Another set of 8,318 BULATS candidates was used as one test set to evaluate the system performance. There are 7 major L1s in this set, each of which has more than 100 candidates: Spanish, Thai, Tamil, Arabic, Vietnamese, Polish and Dutch. There are no overlapping candidates between the BULATS training and test sets. The other test set of 2,540 candidates came from the Linguaskill test, of which there are 6 major L1s each with more than 100 candidates: Hindi, Portuguese, Japanese, Spanish, Thai and Vietnamese. Each of the training set and two test sets was fairly gender balanced, with approximately one third of candidates graded as B1, one third graded as B2, and the rest graded as A1, A2, C1, or C2, according to CEFR ability levels. For each test set candidate, responses from sections A and B were used for speaker enrolment (approximately 180s), while the more challenging free-speaking sections C, D, and E were used for whole section-level verification (approximately 60s for each section).
5 Experimental results
5.1 Baseline system performance
Gender is generally considered an important speaker attribute, and impostor trials were first selected from the same gender group as the reference speaker, as commonly done in standard speaker verification tasks. This resulted in a total of 104.8 million verification trials for the BULATS test set and 9.7 million trials for the Linguaskill test set.
An i-vector/PLDA system and an x-vector/PLDA system were first trained on the “in-domain” BULATS training set. For the i-vector system, 13-dimensional perceptual linear predictive (PLP) features were extracted using the HTK toolkit [Young2015_htk] with a frame-length of 25ms. A UBM of 2,048 mixture components was first trained with full-covariance matrices, and then 600-dimensional i-vectors were extracted for both training and test sets. For the x-vector system, 40-dimensional filterbank features were also extracted using HTK with a frame-length of 25ms. DNN configurations were the same as used in [Snyder2018], and 512-dimensional x-vectors were extracted from the affine component of the segment-level layer immediately following the statistics pooling layer.
Performance of the two baseline systems is shown in Table 1 in terms of equal error rate (EER). The x-vector system yielded lower EERs on both BULATS and Linguaskill test sets.
In addition to the models trained on the BULATS data, it is also interesting to investigate the application of “out-of-the-box” models for standard speaker verification tasks to this non-native speaker verification task as there is limited amounts of non-native learner English data that is publicly available. In this paper, the Kaldi-released [Povey2011] VoxCeleb x-vector/PLDA system was used as imported models, which was trained on augmented VoxCeleb 1 [Nagrani2017] and VoxCeleb 2 [Chung2018]. There are more than 7,000 speakers in the VoxCeleb dataset with more than 2,000 hours of audio data, making it the largest publicly available speaker recognition dataset. 30 dimensional mel-frequency cepstral coefficients (MFCCs) were used as input features and system configurations were the same as the BULATS x-vector/PLDA one. It can be seen from Table 2 that these out-of-domain models gave worse performance than baseline systems trained on a far smaller amount of BULATS data due to domain mismatch. Thus, two kinds of in-domain adaptation strategies were explored to make use of the BULATS training set: PLDA adaptation and x-vector extractor fine-tuning. For PLDA adaptation, x-vectors of the BULATS training set were first extracted using the VoxCeleb-trained x-vector extractor, and then employed to adapt the VoxCeleb-trained PLDA model with their mean and variance. For x-vector extractor fine-tuning, with all other layers of the VoxCeleb-trained model kept still, the output layer was re-initialised using the BULATS training set with the number of targets adjusted accordingly, and then all layers were fine-tuned on the BULATS training set. Here the PLDA adaptation system is referred to as X1 and the extractor fine-tuning system is referred to as X2. Both adaptation approaches can yield good performance gains as can be seen from Table 2. PLDA adaptation is a straightforward yet effective way, while the system with x-vector extractor fine-tuning gave slightly lower EERs on both BULATS and Linguaskill test sets by virtue of a relatively “in-domain” extractor prior to the PLDA back-end.
|+ PLDA adaptation (X1)||0.55||0.62|
|+ Extractor fine-tuning (X2)||0.49||0.55|
Detection Error Tradeoff (DET) curves of the four x-vector/PLDA systems on the BULATS test set were illustrated in Figure 2. It can be seen that, both adaptation systems outperformed the original VoxCeleb-trained system in any threshold of the false alarm (FA) probability and the miss (MS) probability. The extractor fine-tuning system only gave higher MS probability than the PLDA adapted one with FA probability below 0.4%, while for a large range of FA probabilities above 0.4%, the extractor fine-tuning system outperformed the PLDA adapted one.
Furthermore, by leveraging the large-scale VoxCeleb dataset, both adaptation systems produced lower EERs than baseline systems solely trained on BULATS data, especially the extractor fine-tuning one, which gave a reduction rate of 26 in EER over the baseline x-vector/PLDA system on the BULATS test set. It can also be seen from Figure 2 that, the extractor fine-tuning system gave consistently better performance than the baseline systems for almost any threshold of FA and MS.
5.2 Impostor attributes analysis
As mentioned in Section 5.1, gender is an important attribute when selecting impostors. For the non-native English speech data considered in this work, there are two additional attributes that may significantly impact performance, the candidate speaking ability (grade) and L1. In this section, the impact of both attributes on verification performance is analysed on the BULATS test set using the extractor fine-tuning system (X2) detailed in Section 5.1 with impostors selected from the same gender group as the reference speaker. Taking EER as the operating threshold, both grade and L1 breakdown are investigated with respect to the number of impostor trials resulting in false alarm (FA) errors.
As there were only a small number of speakers graded as C1 or C2 in the BULATS test set, the two grade groups were merged into one group as C in the following analysis. Also for a fair comparison, 200 speakers were randomly selected (roughly gender balanced) for each grade group from the BULATS test set, and the grade breakdown is shown in Table 3. For lower grades, impostor trials from the grade group of A1 dominated FA errors as A1 speakers tend to speak short utterances, which is more challenging for the systems. For higher grades (B2 and C), impostor trials from the grade group of C constituted a larger portion of FA errors probably due to the fact that C speakers tend to speak long utterances in a more “native” way and they are also similar to B2 speakers.
|Grade||Grade of Impostor Spkr.|
The numbers of speakers from different L1 groups also varied in the BULATS test set. For a fair comparison, 200 speakers were randomly selected (roughly gender balanced) for each of 6 major L1s. The L1 breakdown is shown in Table 4, where impostor trials from the same L1 group as the reference speaker generally dominated FA errors. English learners from the same L1 group tend to have similar accents when speaking English, which makes them more confusable to speaker verification systems compared to learners from a different L1 group. Particularly, impostors of Thai L1 constitute a considerable portion of FA errors for each L1, as A1 and A2 speakers dominate Thai L1 in the BULATS test set, which is different from other L1s where B1 and B2 speakers dominate.
|L1||L1 of Impostor Spkr.|
5.3 Overall system performance
Based on the analysis in the previous section, the impact of speaker attributes beyond gender, the grade and L1, were used as additional restrictions on the imposter set selection. The following forms of impostor selection were examined:
gender, impostors from the same gender group as the reference speaker, as in Section 5.1;
grade, impostors from the same grade group as the reference speaker;
grade, impostors from higher grade groups than the reference speaker if the grade of the reference speaker is lower than C, otherwise from C; this case is of practical interest for impersonation in spoken language tests;
L1, impostors from the same L1 group as the reference speaker;
The number of total verification trials decreases with further restriction on impostors, which is shown in Table 5. Table 6 shows the impact on EER of restricting the possible set of impostors according to gender, L1 or grade on both BULATS and Linguaskill test sets. Due to the lack of data for each L1 or grade, X1 and X2 systems that are adapted or fine-tuned on all of the BULATS training set are used for verification. As expected, restricting possible impostors according to speaker attributes yielded higher EERs as the percentage of impostors “close” to the reference speaker increased. Take gender as the starting point, which is the configuration used in previous experiments in Section 5.1. Further restricting the set of impostors to L1 again increased EERs agreeing with the results shown in Table 4, similarly to grade. An interesting result in terms of handling impersonation is that, if the set of impostors is further restricted to grade, EERs decrease compared to simply restricted to gender. The highest EER for both systems was achieved by restricted to gender+L1+grade, which indicates that all these are important speaker attributes of non-native data. The gender+L1+grade case is more related to practical scenarios of impersonation, since it is more likely that a candidate chooses a substitute from the same gender and L1 group but speak the target language better to impersonate him/herself in order to obtain a higher grade in a spoken language test.
For the impersonation scenario where the impostor trials are restricted to gender+L1+grade, the DET curves for all systems including the unadapted VoxCeleb and BULATS trained systems are shown in Figure 3 for the BULATS test set. This allows the overall distribution of FA and MS errors for the aforementioned systems to be evaluated. It can be seen that, compared with the fine-tuned X2 system, the PLDA-adapted X1 system had a lower MS probability when the FA probability was low and had a higher MS probability when the FA probability was high. This implies that the X1 system tends to accept imposters as reference speakers while the X2 system tends to reject reference speakers as impostors. For malpractice candidate impersonation in spoken language tests, the X2 system may have a high cost as it may incorrectly identify malpractice in valid candidates. This would require manual checks to confirm this classification. In contrast, the X1 system may result in a lower level of security because it has a higher chance of misidentifying the candidate who is impersonating another. Based on these complementary trends, a score-level linear combination of the two systems was performed with weights of 0.7 and 0.3 for X1 and X2 systems, respectively. The combination system gave consistently better performance for a wide range of FA and MS probabilities than the aforementioned systems with an EER of 0.58% on the BULATS test set, as demonstrated in Figure 3. The same trend was also observed at these weightings on the Linguaskill test set with an EER of 0.72% for the combination system, approximately 8% relative reduction in EER from the X1 system. Thus, the combination of the two adapted systems making use of both large-scale VoxCeleb data and in-domain BULATS data, can serve as a sensible configuration for impersonation detection in spoken language tests.
This paper has investigated malpractice in the form of candidate impersonation for spoken language assessment. This task has close relationships to standard speaker verification, but applied to the domain of non-native speech. Advanced neural network based speaker verification systems were built on both limited non-native spoken English data from the BULATS test, and a large standard corpus VoxCeleb. For the configuration used all systems yielded relatively low EERs of less than 1%. Though built with only limited data the systems trained on just BULATS systems outperformed the “out-of-the-box” VoxCeleb based system. However by adapting both the PLDA model and the deep speaker representation, the VoxCeleb-based systems could yield lower EERs. The attributes of the “impostors” was then analysed in terms of both the impostor’s grade and L1. As expected, L1 was the most important attribute of the impostor selected, though the grade did also influence performance. With the most likely scenario of impersonation by restricting impostors to be from the same gender, same L1, and higher grade group, the combination of the two adapted systems gave consistently better performance for a wide range of FA and MS probabilities, making it a sensible configuration for impersonation detection.