Abstract
The rapid development of Large Language Models (LLMs) and the emergence ofnovel abilities with scale have necessitated the construction of holistic,diverse and challenging benchmarks such as HELM and BIG-bench. However, at themoment, most of these benchmarks focus only on performance in English andevaluations that include Southeast Asian (SEA) languages are few in number. Wetherefore propose BHASA, a holistic linguistic and cultural evaluation suitefor LLMs in SEA languages. It comprises three components: (1) a NLP benchmarkcovering eight tasks across Natural Language Understanding (NLU), Generation(NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkitthat spans the gamut of linguistic phenomena including syntax, semantics andpragmatics, and (3) a cultural diagnostics dataset that probes for bothcultural representation and sensitivity. For this preliminary effort, weimplement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil,and we only include Indonesian and Tamil for LINDSEA and the culturaldiagnostics dataset. As GPT-4 is purportedly one of the best-performingmultilingual LLMs at the moment, we use it as a yardstick to gauge thecapabilities of LLMs in the context of SEA languages. Our initial experimentson GPT-4 with BHASA find it lacking in various aspects of linguisticcapabilities, cultural representation and sensitivity in the targeted SEAlanguages. BHASA is a work in progress and will continue to be improved andexpanded in the future.