NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian

  • 2023-12-03 08:09:45
  • Peng Liu, Lemei Zhang, Terje Nissen Farup, Even W. Lauvrak, Jon Espen Ingvaldsen, Simen Eide, Jon Atle Gulla, Zhirong Yang
  • 0

Abstract

Recent advancements in Generative Language Models (GLMs) have transformedNatural Language Processing (NLP) by showcasing the effectiveness of the"pre-train, prompt, and predict" paradigm in utilizing pre-trained GLMknowledge for diverse applications. Despite their potential, these capabilitieslack adequate quantitative characterization due to the absence of comprehensivebenchmarks, particularly for low-resource languages. Existing low-resourcebenchmarks focus on discriminative language models like BERT, neglecting theevaluation of generative language models. Moreover, current benchmarks oftenoverlook measuring generalization performance across multiple tasks, a crucialmetric for GLMs. To bridge these gaps, we introduce NLEBench, a comprehensive benchmarktailored for evaluating natural language generation capabilities in Norwegian,a low-resource language. We use Norwegian as a case study to explore whethercurrent GLMs and benchmarks in mainstream languages like English can reveal theunique characteristics of underrepresented languages. NLEBench encompasses asuite of real-world NLP tasks ranging from news storytelling, summarization,open-domain conversation, natural language understanding, instructionfine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thoughtinvestigation. It features two high-quality, human-annotated datasets: aninstruction dataset covering traditional Norwegian cultures, idioms, slang, andspecial expressions, and a document-grounded multi-label dataset for topicclassification, question answering, and summarization. This paper alsointroduces foundational Norwegian Generative Language Models (NorGLMs)developed with diverse parameter scales and Transformer-based architectures.Systematic evaluations on the proposed benchmark suite provide insights intothe capabilities and scalability of NorGLMs across various downstream tasks.

 

Quick Read (beta)

loading the full paper ...