From Measurement Instruments to Data: Leveraging Theory-Driven Synthetic Training Data for Classifying Social Constructs

  • 2024-10-17 09:28:45
  • Lukas Birkenmaier, Matthias Roth, Indira Sen
  • 0


Computational text classification is a challenging task, especially formulti-dimensional social constructs. Recently, there has been increasingdiscussion that synthetic training data could enhance classification byoffering examples of how these constructs are represented in texts. In thispaper, we systematically examine the potential of theory-driven synthetictraining data for improving the measurement of social constructs. Inparticular, we explore how researchers can transfer established knowledge frommeasurement instruments in the social sciences, such as survey scales orannotation codebooks, into theory-driven generation of synthetic data. Usingtwo studies on measuring sexism and political topics, we assess the added valueof synthetic training data for fine-tuning text classification models. Althoughthe results of the sexism study were less promising, our findings demonstratethat synthetic data can be highly effective in reducing the need for labeleddata in political topic classification. With only a minimal drop inperformance, synthetic data allows for substituting large amounts of labeleddata. Furthermore, theory-driven synthetic data performed markedly better thandata generated without conceptual information in mind.


Quick Read (beta)

loading the full paper ...