Small Languages, Big Models: A Study of Continual Training on Languages of Norway

Abstract

Training large language models requires vast amounts of data, posing achallenge for less widely spoken languages like Norwegian and even more so fortruly low-resource languages like S\'ami. To address this issue, we present anovel three-stage continual training approach. We also experiment withcombining causal and masked language modeling to get more flexible models.Based on our findings, we train, evaluate, and openly release a new largegenerative language model for Norwegian Bokm\r{a}l, Nynorsk, and NorthernS\'ami with 11.4 billion parameters: NorMistral-11B.

Quick Read (beta)

loading the full paper ...