gaBERT -- an Irish Language Model

Abstract

The BERT family of neural language models have become highly popular due totheir ability to provide sequences of text with rich context-sensitive tokenencodings which are able to generalise well to many Natural Language Processingtasks. Over 120 monolingual BERT models covering over 50 languages have beenreleased, as well as a multilingual model trained on 104 languages. Weintroduce, gaBERT, a monolingual BERT model for the Irish language. We compareour gaBERT model to multilingual BERT and show that gaBERT provides betterrepresentations for a downstream parsing task. We also show how differentfiltering criteria, vocabulary size and the choice of subword tokenisationmodel affect downstream performance. We release gaBERT and related code to thecommunity.

Quick Read (beta)

loading the full paper ...