Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

Abstract

Training large language models (LLMs) in low-resource languages such asHebrew poses unique challenges. In this paper, we introduce DictaLM2.0 andDictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on asubstantial corpus of approximately 200 billion tokens in both Hebrew andEnglish. Adapting a pre-trained model to a new language involves specializedtechniques that differ significantly from training a model from scratch orfurther training existing models on well-resourced languages such as English.We outline these novel training methodologies, which facilitate effectivelearning and adaptation to the linguistic properties of Hebrew. Additionally,we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset toenhance its performance on task-specific instructions. To rigorously evaluateour models, we introduce a new benchmark suite for Hebrew LLM evaluation,covering a diverse set of tasks including Question Answering, SentimentAnalysis, Winograd Schema Challenge, Translation, and Summarization. Our worknot only addresses the intricacies of training LLMs in low-resource languagesbut also proposes a framework that can be leveraged for adapting other LLMs tovarious non-English languages, contributing to the broader field ofmultilingual NLP.

Quick Read (beta)

loading the full paper ...