Abstract
Large language models increasingly support multiple languages, yet mostbenchmarks for gender bias remain English-centric. We introduce EuroGEST, adataset designed to measure gender-stereotypical reasoning in LLMs acrossEnglish and 29 European languages. EuroGEST builds on an existingexpert-informed benchmark covering 16 gender stereotypes, expanded in this workusing translation tools, quality estimation metrics, and morphologicalheuristics. Human evaluations confirm that our data generation method resultsin high accuracy of both translations and gender labels across languages. Weuse EuroGEST to evaluate 24 multilingual language models from six modelfamilies, demonstrating that the strongest stereotypes in all models across alllanguages are that women are \textit{beautiful,} \textit{empathetic} and\textit{neat} and men are \textit{leaders}, \textit{strong, tough} and\textit{professional}. We also show that larger models encode genderedstereotypes more strongly and that instruction finetuning does not consistentlyreduce gendered stereotypes. Our work highlights the need for more multilingualstudies of fairness in LLMs and offers scalable methods and resources to auditgender bias across languages.