Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russia

Abstract

There has been a surge in developing various Large Language Models (LLMs).However, text generation for languages other than English often facessignificant challenges, including poor generation quality and reducedcomputational performance due to the disproportionate representation of tokensin the model's vocabulary. In this work, we address these issues by developinga pipeline for adapting English-oriented pre-trained models to other languagesand constructing efficient bilingual LLMs. Using this pipeline, we constructVikhr, a state-of-the-art bilingual open-source instruction-following LLMdesigned specifically for the Russian language. "Vikhr" refers to the name ofthe Mistral LLM series and means a "strong gust of wind." Unlike previousRussian-language models that typically rely on LoRA adapters on top ofEnglish-oriented models, sacrificing performance for lower training costs,Vikhr features an adapted tokenizer vocabulary and undergoes continuedpre-training and instruction tuning of all weights. This not only enhances themodel's performance but also significantly improves its computational andcontextual efficiency. The remarkable performance of Vikhr across variousRussian-language benchmarks can also be attributed to our efforts in expandinginstruction datasets and corpora for continued pre-training. Vikhr not onlysets a new state of the art among open-source LLMs for Russian but evenoutperforms some proprietary closed-source models on certain benchmarks. Themodel weights, instruction sets, and code are publicly available.

Quick Read (beta)

loading the full paper ...