Poro 34B and the Blessing of Multilinguality

Abstract

The pretraining of state-of-the-art large language models now requirestrillions of words of text, which is orders of magnitude more than availablefor the vast majority of languages. While including text in more than onelanguage is an obvious way to acquire more pretraining data, multilinguality isoften seen as a curse, and most model training efforts continue to focusnear-exclusively on individual large languages. We believe that multilingualitycan be a blessing: when the lack of training data is a constraint foreffectively training larger models for a target language, augmenting thedataset with other languages can offer a way to improve over the capabilitiesof monolingual models for that language. In this study, we introduce Poro 34B,a 34 billion parameter model trained for 1 trillion tokens of Finnish, English,and programming languages, and demonstrate that a multilingual trainingapproach can produce a model that substantially advances over the capabilitiesof existing models for Finnish and excels in translation, while also achievingcompetitive performance in its class for English and programming languages. Werelease the model parameters, scripts, and data under open licenses athttps://huggingface.co/LumiOpen/Poro-34B.

Quick Read (beta)

loading the full paper ...