Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages

Abstract

Language Identification (LI) is crucial for various natural languageprocessing tasks, serving as a foundational step in applications such assentiment analysis, machine translation, and information retrieval. Inmultilingual societies like India, particularly among the youth engaging onsocial media, text often exhibits code-mixing, blending local languages withEnglish at different linguistic levels. This phenomenon presents formidablechallenges for LI systems, especially when languages intermingle within singlewords. Dravidian languages, prevalent in southern India, possess richmorphological structures yet suffer from under-representation in digitalplatforms, leading to the adoption of Roman or hybrid scripts forcommunication. This paper introduces a prompt based method for a shared taskaimed at addressing word-level LI challenges in Dravidian languages. In thiswork, we leveraged GPT-3.5 Turbo to understand whether the large languagemodels is able to correctly classify words into correct categories. Ourfindings show that the Kannada model consistently outperformed the Tamil modelacross most metrics, indicating a higher accuracy and reliability inidentifying and categorizing Kannada language instances. In contrast, the Tamilmodel showed moderate performance, particularly needing improvement inprecision and recall.

Quick Read (beta)

loading the full paper ...