Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History

Abstract

In this work, we evaluated Lithuanian and general history knowledge ofmultilingual Large Language Models (LLMs) on a multiple-choicequestion-answering task. The models were tested on a dataset of Lithuaniannational and general history questions translated into Baltic, Nordic, andother languages (English, Ukrainian, Arabic) to assess the knowledge sharingfrom culturally and historically connected groups. We evaluated GPT-4o,LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b). Our results show that GPT-4o consistently outperformed all other modelsacross language groups, with slightly better results for Baltic and Nordiclanguages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70bperformed well but showed weaker alignment with Baltic languages. Smallermodels (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b)demonstrated gaps with LT-related alignment with Baltic languages whileperforming better on Nordic and other languages. The Nordic fine-tuned modelsdid not surpass multilingual models, indicating that shared cultural orhistorical context alone does not guarantee better performance.

Quick Read (beta)

loading the full paper ...