Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

  • 2025-01-15 17:47:57
  • Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen K. Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris G. Ebby, Jillian Gorskic, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar
  • 0

Abstract

As Large Language Models (LLMs) are integrated into electronic health record(EHR) workflows, validated instruments are essential to evaluate theirperformance before implementation. Existing instruments for providerdocumentation quality are often unsuitable for the complexities ofLLM-generated text and lack validation on real-world data. The ProviderDocumentation Summarization Quality Instrument (PDSQI-9) was developed toevaluate LLM-generated clinical summaries. Multi-document summaries weregenerated from real-world EHR data across multiple specialties using severalLLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearsoncorrelation for substantive validity, factor analysis and Cronbach's alpha forstructural validity, inter-rater reliability (ICC and Krippendorff's alpha) forgeneralizability, a semi-Delphi process for content validity, and comparisonsof high- versus low-quality summaries for discriminant validity. Sevenphysician raters evaluated 779 summaries and answered 8,329 questions,achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstratedstrong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) andhigh inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supportingstructural validity and generalizability. Factor analysis identified a 4-factormodel explaining 58% of the variance, representing organization, clarity,accuracy, and utility. Substantive validity was supported by correlationsbetween note length and scores for Succinct (rho = -0.200, p = 0.029) andOrganized (rho = -0.190, p = 0.037). Discriminant validity distinguished high-from low-quality summaries (p < 0.001). The PDSQI-9 demonstrates robustconstruct validity, supporting its use in clinical practice to evaluateLLM-generated summaries and facilitate safer integration of LLMs intohealthcare workflows.

 

Quick Read (beta)

loading the full paper ...