Abstract
We introduce EuroParlVote, a novel benchmark for evaluating large languagemodels (LLMs) in politically sensitive contexts. It links European Parliamentdebate speeches to roll-call vote outcomes and includes rich demographicmetadata for each Member of the European Parliament (MEP), such as gender, age,country, and political group. Using EuroParlVote, we evaluate state-of-the-artLLMs on two tasks -- gender classification and vote prediction -- revealingconsistent patterns of bias. We find that LLMs frequently misclassify femaleMEPs as male and demonstrate reduced accuracy when simulating votes for femalespeakers. Politically, LLMs tend to favor centrist groups while underperformingon both far-left and far-right ones. Proprietary models like GPT-4o outperformopen-weight alternatives in terms of both robustness and fairness. We releasethe EuroParlVote dataset, code, and demo to support future research on fairnessand accountability in NLP within political contexts.