gaBERT -- an Irish Language Model

  • 2021-07-27 16:38:53
  • James Barry, Joachim Wagner, Lauren Cassidy, Alan Cowap, Teresa Lynn, Abigail Walsh, Mícheál J. Ó Meachair, Jennifer Foster
  • 3


The BERT family of neural language models have become highly popular due totheir ability to provide sequences of text with rich context-sensitive tokenencodings which are able to generalise well to many Natural Language Processingtasks. Over 120 monolingual BERT models covering over 50 languages have beenreleased, as well as a multilingual model trained on 104 languages. Weintroduce, gaBERT, a monolingual BERT model for the Irish language. We compareour gaBERT model to multilingual BERT and show that gaBERT provides betterrepresentations for a downstream parsing task. We also show how differentfiltering criteria, vocabulary size and the choice of subword tokenisationmodel affect downstream performance. We release gaBERT and related code to thecommunity.


