Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Abstract

In natural language processing (NLP), code-mixing (CM) is a challenging task,especially when the mixed languages include dialects. In Southeast Asiancountries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is themost widespread code-mixed language pair among Chinese immigrants, and it isalso common in Taiwan. However, dialects such as Hokkien often have a scarcityof resources and the lack of an official writing system, limiting thedevelopment of dialect CM research. In this paper, we propose a method toconstruct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcomethe morphological issue under the Sino-Tibetan language family, and offer anefficient Hokkien word segmentation method through a linguistics-based toolkit.Furthermore, we use our proposed dataset and employ transfer learning to trainthe XLM (cross-lingual language model) for translation tasks. To fit thecode-mixing scenario, we adapt XLM slightly. We found that by using linguisticknowledge, rules, and language tags, the model produces good results on CM datatranslation while maintaining monolingual translation quality.

Quick Read (beta)

loading the full paper ...