Small Languages, Big Models: A Study of Continual Training on Languages of Norway

  • 2025-02-02 23:58:48
  • David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, Stephan Oepen
  • 0

Abstract

Training large language models requires vast amounts of data, posing achallenge for less widely spoken languages like Norwegian and even more so fortruly low-resource languages like Northern S\'ami. To address this issue, wepresent a novel three-stage continual training approach that substantiallyimproves the downstream performance together with the inference efficiency forthe target languages. Based on our findings, we train, evaluate, and openlyrelease a new generative language model for Norwegian Bokm\r{a}l, Nynorsk, andNorthern S\'ami with 11.4 billion parameters: NorMistral-11B.

 

Quick Read (beta)

loading the full paper ...