3LM: Bridging Arabic, STEM, and Code through Benchmarking

Abstract

Arabic is one of the most widely spoken languages in the world, yet effortsto develop and evaluate Large Language Models (LLMs) for Arabic remainrelatively limited. Most existing Arabic benchmarks focus on linguistic,cultural, or religious content, leaving a significant gap in domains like STEMand code which are increasingly relevant for real-world LLM applications. Tohelp bridge this gap, we present 3LM, a suite of three benchmarks designedspecifically for Arabic. The first is a set of STEM-related question-answerpairs, naturally sourced from Arabic textbooks and educational worksheets. Thesecond consists of synthetically generated STEM questions, created using thesame sources. The third benchmark focuses on code generation, built through acareful translation of two widely used code benchmarks, incorporating ahuman-in-the-loop process with several rounds of review to ensure high-qualityand faithful translations. We release all three benchmarks publicly to supportthe growth of Arabic LLM research in these essential but underrepresentedareas.

Quick Read (beta)

loading the full paper ...