Abstract
The language technology moonshot moment of Generative Large Language Models(GLLMs) was not limited to English: These models brought a surge oftechnological applications, investments, and hype to low-resource languages aswell. However, the capabilities of these models in languages such as Danishwere, until recently, difficult to verify beyond qualitative demonstrations dueto a lack of applicable evaluation corpora. We present a GLLM benchmark toevaluate \emph{Danoliteracy}, a measure of Danish language and culturalcompetency across eight diverse scenarios such as Danish citizenship tests andabstractive social media question answering. This limited-size benchmark wasfound to produce a robust ranking that correlates to human feedback at $\rho\sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings.Analyzing these model results across scenarios, we find one strong underlyingfactor explaining $95\%$ of scenario performance variance for GLLMs in Danish,suggesting a $g$ factor of model consistency in language adaptation.