Improving Indonesian Text Classification Using Multilingual Language Model

Abstract

Compared to English, the amount of labeled data for Indonesian textclassification tasks is very small. Recently developed multilingual languagemodels have shown its ability to create multilingual representationseffectively. This paper investigates the effect of combining English andIndonesian data on building Indonesian text classification (e.g., sentimentanalysis and hate speech) using multilingual language models. Using thefeature-based approach, we observe its performance on various data sizes andtotal added English data. The experiment showed that the addition of Englishdata, especially if the amount of Indonesian data is small, improvesperformance. Using the fine-tuning approach, we further showed itseffectiveness in utilizing the English language to build Indonesian textclassification models.

Quick Read (beta)

loading the full paper ...