Are pre-trained text representations useful for multilingual and multi-dimensional language proficiency modeling?

Abstract

Development of language proficiency models for non-native learners has beenan active area of interest in NLP research for the past few years. Althoughlanguage proficiency is multidimensional in nature, existing research typicallyconsiders a single "overall proficiency" while building models. Further,existing approaches also considers only one language at a time. This paperdescribes our experiments and observations about the role of pre-trained andfine-tuned multilingual embeddings in performing multi-dimensional,multilingual language proficiency classification. We report experiments withthree languages -- German, Italian, and Czech -- and model seven dimensions ofproficiency ranging from vocabulary control to sociolinguistic appropriateness.Our results indicate that while fine-tuned embeddings are useful formultilingual proficiency modeling, none of the features achieve consistentlybest performance for all dimensions of language proficiency. All code, data andrelated supplementary material can be found at:https://github.com/nishkalavallabhi/MultidimCEFRScoring.

Quick Read (beta)

loading the full paper ...