Abstract
Large Language Models (LLMs) are widely used in Automated Essay Scoring (AES)due to their ability to capture semantic meaning. Traditional fine-tuningapproaches required technical expertise, limiting accessibility for educatorswith limited technical backgrounds. However, prompt-based tools like ChatGPThave made AES more accessible, enabling educators to obtain machine-generatedscores using natural-language prompts (i.e., the prompt-based paradigm).Despite advancements, prior studies have shown bias in fine-tuned LLMs,particularly against disadvantaged groups. It remains unclear whether suchbiases persist or are amplified in the prompt-based paradigm with cutting-edgetools. Since such biases are believed to stem from the demographic informationembedded in pre-trained models (i.e., the ability of LLMs' text embeddings topredict demographic attributes), this study explores the relationship betweenthe model's predictive power of students' demographic attributes based on theirwritten works and its predictive bias in the scoring task in the prompt-basedparadigm. Using a publicly available dataset of over 25,000 students'argumentative essays, we designed prompts to elicit demographic inferences(i.e., gender, first-language background) from GPT-4o and assessed fairness inautomated scoring. Then we conducted multivariate regression analysis toexplore the impact of the model's ability to predict demographics on itsscoring outcomes. Our findings revealed that (i) prompt-based LLMs can somewhatinfer students' demographics, particularly their first-language backgrounds,from their essays; (ii) scoring biases are more pronounced when the LLMcorrectly predicts students' first-language background than when it does not;and (iii) scoring error for non-native English speakers increases when the LLMcorrectly identifies them as non-native.