Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

  • 2023-09-07 04:20:41
  • Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Samuel Cahyawijaya, Holy Lovenia, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Long Phan, Yin Lin Tan, Thamar Solorio, Alham Fikri Aji
  • 0

Abstract

While code-mixing is a common linguistic practice in many parts of the world,collecting high-quality and low-cost code-mixed data remains a challenge fornatural language processing (NLP) research. The recent proliferation of LargeLanguage Models (LLMs) compels one to ask: how capable are these systems ingenerating code-mixed data? In this paper, we explore prompting multilingualLLMs in a zero-shot manner to generate code-mixed data for seven languages inSouth East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese,Tamil, and Singlish. We find that publicly available multilingualinstruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable ofproducing texts with phrases or clauses from different languages. ChatGPTexhibits inconsistent capabilities in generating code-mixed texts, wherein itsperformance varies depending on the prompt template and language pairing. Forinstance, ChatGPT generates fluent and natural Singlish texts (an English-basedcreole spoken in Singapore), but for English-Tamil language pair, the systemmostly produces grammatically incorrect or semantically meaningless utterances.Furthermore, it may erroneously introduce languages not specified in theprompt. Based on our investigation, existing multilingual LLMs exhibit a widerange of proficiency in code-mixed data generation for SEA languages. As such,we advise against using LLMs in this context without extensive human checks.

 

Quick Read (beta)

loading the full paper ...