BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

Abstract

Large Language Models (LLMs) have emerged as one of the most importantbreakthroughs in NLP for their impressive skills in language generation andother language-specific tasks. Though LLMs have been evaluated in varioustasks, mostly in English, they have not yet undergone thorough evaluation inunder-resourced languages such as Bengali (Bangla). To this end, this paperintroduces BenLLM-Eval, which consists of a comprehensive evaluation of LLMs tobenchmark their performance in the Bengali language that has modest resources.In this regard, we select various important and diverse Bengali NLP tasks, suchas text summarization, question answering, paraphrasing, natural languageinference, transliteration, text classification, and sentiment analysis forzero-shot evaluation of popular LLMs, namely, GPT-3.5, LLaMA-2-13b-chat, andClaude-2. Our experimental results demonstrate that while in some Bengali NLPtasks, zero-shot LLMs could achieve performance on par, or even better thancurrent SOTA fine-tuned models; in most tasks, their performance is quite poor(with the performance of open-source LLMs like LLaMA-2-13b-chat beingsignificantly bad) in comparison to the current SOTA results. Therefore, itcalls for further efforts to develop a better understanding of LLMs inmodest-resourced languages like Bengali.

Quick Read (beta)

loading the full paper ...