Leveraging Open-Source Large Language Models for Native Language Identification

Abstract

Native Language Identification (NLI) - the task of identifying the nativelanguage (L1) of a person based on their writing in the second language (L2) -has applications in forensics, marketing, and second language acquisition.Historically, conventional machine learning approaches that heavily rely onextensive feature engineering have outperformed transformer-based languagemodels on this task. Recently, closed-source generative large language models(LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in azero-shot setting, including promising results in open-set classification.However, closed-source LLMs have many disadvantages, such as high costs andundisclosed nature of training data. This study explores the potential of usingopen-source LLMs for NLI. Our results indicate that open-source LLMs do notreach the accuracy levels of closed-source LLMs when used out-of-the-box.However, when fine-tuned on labeled training data, open-source LLMs can achieveperformance comparable to that of commercial LLMs.

Quick Read (beta)

loading the full paper ...