LLMic: Romanian Foundation Language Model

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated remarkablecapabilities across various tasks with commercial models leading the way. Whileopen models usually operate at a smaller scale, they maintain competitivenessthrough specialization and fine-tuning. However, a significant challengepersists: open models often underperform in low-resource languages due tolimited representation in the training corpus. In this paper, we present LLMic,a bilingual foundation language model designed specifically for the RomanianLanguage. We document the complete process of pretraining a foundation modelfor a low-resource language, including corpus construction, architectureselection, and hyper-parameter optimization. Our evaluation demonstrates thatLLMic can be specialized for tasks in the target language, achieving resultscomparable to other much larger open models. We show that fine-tuning LLMic forlanguage translation after the initial pretraining phase outperforms existingsolutions in English-to-Romanian translation tasks. This opens the path forefficient large-scale processing for the Romanian language community, using themuch smaller LLMic model

Quick Read (beta)

loading the full paper ...