Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Abstract

Bengali is an underrepresented language in NLP research. However, it remainsa challenge due to its unique linguistic structure and computationalconstraints. In this work, we systematically investigate the challenges thathinder Bengali NLP performance by focusing on the absence of standardizedevaluation benchmarks. We then evaluated 10 recent open source Large LanguageModels (LLMs) in 8 of the translated datasets and performed a comprehensiveerror analysis to pinpoint their primary failure modes. Our findings revealconsistent performance gaps for Bengali compared to English, particularly forsmaller models and specific model families like Mistral. We also identifiedpromising robustness in certain architectures, such as DeepSeek, that maintainmore stable performance across languages. Our analysis reveals an inverserelationship between tokenization efficiency and LLM accuracy where models tendto perform worse when inputs are excessively tokenized, whereas more efficient\& concise tokenization results in improved performance. These findingshighlight critical areas where current models fall short and underscore theneed for improved dataset quality and evaluation methodologies tailored tomultilingual contexts. This work will catalyze further research on NLP forunderrepresented languages, helping to democratize access to advanced languagetechnologies worldwide. The code and dataset used in this research is publiclyavailable at https://github.com/BengaliAI/bn-llm-benchmark.

Quick Read (beta)

loading the full paper ...