HICode: Hierarchical Inductive Coding with LLMs

  • 2025-09-22 16:07:11
  • Mian Zhong, Pristina Wang, Anjalie Field
  • 0

Abstract

Despite numerous applications for fine-grained corpus analysis, researcherscontinue to rely on manual labeling, which does not scale, or statistical toolslike topic modeling, which are difficult to control. We propose that LLMs havethe potential to scale the nuanced analyses that researchers typically conductmanually to large text corpora. To this effect, inspired by qualitativeresearch methods, we develop HICode, a two-part pipeline that first inductivelygenerates labels directly from analysis data and then hierarchically clustersthem to surface emergent themes. We validate this approach across three diversedatasets by measuring alignment with human-constructed themes and demonstratingits robustness through automated and human evaluations. Finally, we conduct acase study of litigation documents related to the ongoing opioid crisis in theU.S., revealing aggressive marketing strategies employed by pharmaceuticalcompanies and demonstrating HICode's potential for facilitating nuancedanalyses in large-scale data.

 

Quick Read (beta)

loading the full paper ...