Abstract
While the capabilities of Large Language Models (LLMs) have been studied inboth Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibitdifferential performance when prompted in these two variants of writtenChinese. This understanding is critical, as disparities in the quality of LLMresponses can perpetuate representational harms by ignoring the differentcultural contexts underlying Simplified versus Traditional Chinese, and canexacerbate downstream harms in LLM-facilitated decision-making in domains suchas education or hiring. To investigate potential LLM performance disparities,we design two benchmark tasks that reflect real-world scenarios: regional termchoice (prompting the LLM to name a described item which is referred todifferently in Mainland China and Taiwan), and regional name choice (promptingthe LLM to choose who to hire from a list of names in both Simplified andTraditional Chinese). For both tasks, we audit the performance of 11 leadingcommercial LLM services and open-sourced models -- spanning those primarilytrained on English, Simplified Chinese, or Traditional Chinese. Our analysesindicate that biases in LLM responses are dependent on both the task andprompting language: while most LLMs disproportionately favored SimplifiedChinese responses in the regional term choice task, they surprisingly favoredTraditional Chinese names in the regional name choice task. We find that thesedisparities may arise from differences in training data representation, writtencharacter preferences, and tokenization of Simplified and Traditional Chinese.These findings highlight the need for further analysis of LLM biases; as such,we provide an open-sourced benchmark dataset to foster reproducible evaluationsof future LLM behavior across Chinese language variants(https://github.com/brucelyu17/SC-TC-Bench).