Small Languages, Big Models: A Study of Continual Training on Languages of Norway

Abstract

Training large language models requires vast amounts of data, posing achallenge for less widely spoken languages like Norwegian and even more so fortruly low-resource languages like Northern S\'ami. To address this issue, wepresent a novel three-stage continual training approach that substantiallyimproves the downstream performance together with the inference efficiency forthe target languages. Based on our findings, we train, evaluate, and openlyrelease a new generative language model for Norwegian Bokm\r{a}l, Nynorsk, andNorthern S\'ami with 11.4 billion parameters: NorMistral-11B.

Quick Read (beta)

loading the full paper ...