Abstract
Recent advancements in Generative Language Models (GLMs) have transformedNatural Language Processing (NLP) by showcasing the effectiveness of the"pre-train, prompt, and predict" paradigm in utilizing pre-trained GLMknowledge for diverse applications. Despite their potential, these capabilitieslack adequate quantitative characterization due to the absence of comprehensivebenchmarks, particularly for low-resource languages. Existing low-resourcebenchmarks focus on discriminative language models like BERT, neglecting theevaluation of generative language models. Moreover, current benchmarks oftenoverlook measuring generalization performance across multiple tasks, a crucialmetric for GLMs. To bridge these gaps, we introduce NLEBench, a comprehensive benchmarktailored for evaluating natural language generation capabilities in Norwegian,a low-resource language. We use Norwegian as a case study to explore whethercurrent GLMs and benchmarks in mainstream languages like English can reveal theunique characteristics of underrepresented languages. NLEBench encompasses asuite of real-world NLP tasks ranging from news storytelling, summarization,open-domain conversation, natural language understanding, instructionfine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thoughtinvestigation. It features two high-quality, human-annotated datasets: aninstruction dataset covering traditional Norwegian cultures, idioms, slang, andspecial expressions, and a document-grounded multi-label dataset for topicclassification, question answering, and summarization. This paper alsointroduces foundational Norwegian Generative Language Models (NorGLMs)developed with diverse parameter scales and Transformer-based architectures.Systematic evaluations on the proposed benchmark suite provide insights intothe capabilities and scalability of NorGLMs across various downstream tasks.