Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

Abstract

Large Language Models (LLMs) are increasingly used as proxies for humansubjects in social science surveys, but their reliability and susceptibility toknown response biases are poorly understood. This paper investigates theresponse robustness of LLMs in normative survey contexts - we test nine diverseLLMs on questions from the World Values Survey (WVS), applying a comprehensiveset of 11 perturbations to both question phrasing and answer option structure,resulting in over 167,000 simulated interviews. In doing so, we not only revealLLMs' vulnerabilities to perturbations but also show that all tested modelsexhibit a consistent recency bias varying in intensity, disproportionatelyfavoring the last-presented answer option. While larger models are generallymore robust, all models remain sensitive to semantic variations likeparaphrasing and to combined perturbations. By applying a set of perturbations,we reveal that LLMs partially align with survey response biases identified inhumans. This underscores the critical importance of prompt design androbustness testing when using LLMs to generate synthetic survey data.

Quick Read (beta)

loading the full paper ...