The merits of Universal Language Model Fine-tuning for Small Datasets -- a case with Dutch book reviews

Abstract

We evaluated the effectiveness of using language models, that werepre-trained in one domain, as the basis for a classification model in anotherdomain: Dutch book reviews. Pre-trained language models have opened up newpossibilities for classification tasks with limited labelled data, becauserepresentation can be learned in an unsupervised fashion. In our experiments wehave studied the effects of training set size (100-1600 items) on theprediction accuracy of a ULMFiT classifier, based on a language models that wepre-trained on the Dutch Wikipedia. We also compared ULMFiT to Support VectorMachines, which is traditionally considered suitable for small collections. Wefound that ULMFiT outperforms SVM for all training set sizes and thatsatisfactory results (~90%) can be achieved using training sets that can bemanually annotated within a few hours. We deliver both our new benchmarkcollection of Dutch book reviews for sentiment classification as well as thepre-trained Dutch language model to the community.

Quick Read (beta)

loading the full paper ...