MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Abstract

Recently, there has been a rapid advancement in research on Large LanguageModels (LLMs), resulting in significant progress in several Natural LanguageProcessing (NLP) tasks. Consequently, there has been a surge in LLM evaluationresearch to comprehend the models' capabilities and limitations. However, muchof this research has been confined to the English language, leaving LLMbuilding and evaluation for non-English languages relatively unexplored. Therehas been an introduction of several new LLMs, necessitating their evaluation onnon-English languages. This study aims to expand our MEGA benchmarking suite byincluding six new datasets to form the MEGAVERSE benchmark. The benchmarkcomprises 22 datasets covering 81 languages, including low-resource Africanlanguages. We evaluate several state-of-the-art LLMs like GPT-3.5-Turbo, GPT4,PaLM2, and Llama2 on the MEGAVERSE datasets. Additionally, we include twomultimodal datasets in the benchmark and assess the performance of theLLaVa-v1.5 model. Our experiments suggest that GPT4 and PaLM2 outperform theLlama models on various tasks, notably on low-resource languages, with GPT4outperforming PaLM2 on more datasets than vice versa. However, issues such asdata contamination must be addressed to obtain an accurate assessment of LLMperformance on non-English languages.

Quick Read (beta)

loading the full paper ...