Evaluating Language Model Finetuning Techniques for Low-resource Languages

Abstract

Unlike mainstream languages (such as English and French), low-resourcelanguages often suffer from a lack of expert-annotated corpora and benchmarkresources that make it hard to apply state-of-the-art techniques directly. Inthis paper, we alleviate this scarcity problem for the low-resourced Filipinolanguage in two ways. First, we introduce a new benchmark language modelingdataset in Filipino which we call WikiText-TL-39. Second, we show that languagemodel finetuning techniques such as BERT and ULMFiT can be used to consistentlytrain robust classifiers in low-resource settings, experiencing at most a0.0782 increase in validation error when the number of training examples isdecreased from 10K to 1K while finetuning using a privately-held sentimentdataset.

Quick Read (beta)

loading the full paper ...