Give your Text Representation Models some Love: the Case for Basque

  • 2020-04-02 11:46:52
  • Rodrigo Agerri, IƱaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre
  • 0

Abstract

Word embeddings and pre-trained language models allow to build richrepresentations of text and have enabled improvements across most NLP tasks.Unfortunately they are very expensive to train, and many small companies andresearch groups tend to use models that have been pre-trained and madeavailable by third parties, rather than building their own. This is suboptimalas, for many languages, the models have been trained on smaller (or lowerquality) corpora. In addition, monolingual pre-trained models for non-Englishlanguages are not always available. At best, models for those languages areincluded in multilingual versions, where each language shares the quota ofsubstrings and parameters with the rest of the languages. This is particularlytrue for smaller languages such as Basque. In this paper we show that a numberof monolingual models (FastText word embeddings, FLAIR and BERT languagemodels) trained with larger Basque corpora produce much better results thanpublicly available versions in downstream NLP tasks, including topicclassification, sentiment classification, PoS tagging and NER. This work sets anew state-of-the-art in those tasks for Basque. All benchmarks and models usedin this work are publicly available.

 

Quick Read (beta)

loading the full paper ...