Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Abstract

Multilingual translation stands as a challenging task for large languagemodels (LLMs) to handle intricate language patterns and stilted translationsthat arise in automated translations. In this paper, we introduce Seed-X, afamily of open-source LLMs comprising instruct and reasoning models, pushingthe limits of translation capability with 7B parameter size. The base model ispre-trained on a diverse, high-quality dataset encompassing both monolingualand bilingual content across 28 languages, harnessing the full potential ofmultilingual data. The instruct model is then finetuned to translate byChain-of-Thought (CoT) reasoning and further enhanced through reinforcementlearning (RL) to achieve better generalization across diverse language pairs.Seed-X achieves performance comparable to leading closed-source models,including Gemini-2.5 and GPT-4o, across 28 languages, and significantlyoutperforms larger open-source models in both automatic metrics and humanevaluations. We share the best practices through our optimization process, andmake the parameter public available for advancing translation research andapplications.

Quick Read (beta)

loading the full paper ...