Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models

Abstract

High-quality data resources play a crucial role in learning large languagemodels (LLMs), particularly for low-resource languages like Cantonese. Despitehaving more than 85 million native speakers, Cantonese is still considered alow-resource language in the field of natural language processing (NLP) due tofactors such as the dominance of Mandarin, lack of cohesion within theCantonese-speaking community, diversity in character encoding and inputmethods, and the tendency of overseas Cantonese speakers to prefer usingEnglish. In addition, rich colloquial vocabulary of Cantonese, Englishloanwords, and code-switching characteristics add to the complexity of corpuscollection and processing. To address these challenges, we collect Cantonesetexts from a variety of sources, including open source corpora, HongKong-specific forums, Wikipedia, and Common Crawl data. We conduct rigorousdata processing through language filtering, quality filtering, contentfiltering, and de-duplication steps, successfully constructing a high-qualityCantonese corpus of over 2 billion tokens for training large language models.We further refined the model through supervised fine-tuning (SFT) on curatedCantonese tasks, enhancing its ability to handle specific applications. Uponcompletion of the training, the model achieves state-of-the-art (SOTA)performance on four Cantonese benchmarks. After training on our dataset, themodel also exhibits improved performance on other mainstream language tasks.

Quick Read (beta)

loading the full paper ...