Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Abstract

While code-mixing is a common linguistic practice in many parts of the world,collecting high-quality and low-cost code-mixed data remains a challenge fornatural language processing (NLP) research. The recent proliferation of LargeLanguage Models (LLMs) compels one to ask: how capable are these systems ingenerating code-mixed data? In this paper, we explore prompting multilingualLLMs in a zero-shot manner to generate code-mixed data for seven languages inSouth East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese,Tamil, and Singlish. We find that publicly available multilingualinstruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable ofproducing texts with phrases or clauses from different languages. ChatGPTexhibits inconsistent capabilities in generating code-mixed texts, wherein itsperformance varies depending on the prompt template and language pairing. Forinstance, ChatGPT generates fluent and natural Singlish texts (an English-basedcreole spoken in Singapore), but for English-Tamil language pair, the systemmostly produces grammatically incorrect or semantically meaningless utterances.Furthermore, it may erroneously introduce languages not specified in theprompt. Based on our investigation, existing multilingual LLMs exhibit a widerange of proficiency in code-mixed data generation for SEA languages. As such,we advise against using LLMs in this context without extensive human checks.

Quick Read (beta)

loading the full paper ...