Abstract
In natural language processing (NLP), code-mixing (CM) is a challenging task,especially when the mixed languages include dialects. In Southeast Asiancountries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is themost widespread code-mixed language pair among Chinese immigrants, and it isalso common in Taiwan. However, dialects such as Hokkien often have a scarcityof resources and the lack of an official writing system, limiting thedevelopment of dialect CM research. In this paper, we propose a method toconstruct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcomethe morphological issue under the Sino-Tibetan language family, and offer anefficient Hokkien word segmentation method through a linguistics-based toolkit.Furthermore, we use our proposed dataset and employ transfer learning to trainthe XLM (cross-lingual language model) for translation tasks. To fit thecode-mixing scenario, we adapt XLM slightly. We found that by using linguisticknowledge, rules, and language tags, the model produces good results on CM datatranslation while maintaining monolingual translation quality.