Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models

Abstract

Large language models (LLMs) are increasingly deployed in sensitive contextswhere fairness and inclusivity are critical. Pronoun usage, especiallyconcerning gender-neutral and neopronouns, remains a key challenge forresponsible AI. Prior work, such as the MISGENDERED benchmark, revealedsignificant limitations in earlier LLMs' handling of inclusive pronouns, butwas constrained to outdated models and limited evaluations. In this study, weintroduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs'pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4,DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and genderidentity inference. Our results show notable improvements compared withprevious studies, especially in binary and gender-neutral pronoun accuracy.However, accuracy on neopronouns and reverse inference tasks remainsinconsistent, underscoring persistent gaps in identity-sensitive reasoning. Wediscuss implications, model-specific observations, and avenues for futureinclusive AI research.

Quick Read (beta)

loading the full paper ...