Abstract
The recent development and wider accessibility of LLMs have spurreddiscussions about how they can be used in survey research, includingclassifying open-ended survey responses. Due to their linguistic capacities, itis possible that LLMs are an efficient alternative to time-consuming manualcoding and the pre-training of supervised machine learning models. As mostexisting research on this topic has focused on English-language responsesrelating to non-complex topics or on single LLMs, it is unclear whether itsfindings generalize and how the quality of these classifications compares toestablished methods. In this study, we investigate to what extent differentLLMs can be used to code open-ended survey responses in other contexts, usingGerman data on reasons for survey participation as an example. We compareseveral state-of-the-art LLMs and several prompting approaches, and evaluatethe LLMs' performance by using human expert codings. Overall performancediffers greatly between LLMs, and only a fine-tuned LLM achieves satisfactorylevels of predictive performance. Performance differences between promptingapproaches are conditional on the LLM used. Finally, LLMs' unequalclassification performance across different categories of reasons for surveyparticipation results in different categorical distributions when not usingfine-tuning. We discuss the implications of these findings, both formethodological research on coding open-ended responses and for theirsubstantive analysis, and for practitioners processing or substantivelyanalyzing such data. Finally, we highlight the many trade-offs researchers needto consider when choosing automated methods for open-ended responseclassification in the age of LLMs. In doing so, our study contributes to thegrowing body of research about the conditions under which LLMs can beefficiently, accurately, and reliably leveraged in survey research.