Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

Abstract

Evaluating LLM-generated text has become a key challenge, especially indomain-specific contexts like the medical field. This work introduces a novelevaluation methodology for LLM-generated medical explanatory arguments, relyingon Proxy Tasks and rankings to closely align results with human evaluationcriteria, overcoming the biases typically seen in LLMs used as judges. Wedemonstrate that the proposed evaluators are robust against adversarialattacks, including the assessment of non-argumentative text. Additionally, thehuman-crafted arguments needed to train the evaluators are minimized to justone example per Proxy Task. By examining multiple LLM-generated arguments, weestablish a methodology for determining whether a Proxy Task is suitable forevaluating LLM-generated medical explanatory arguments, requiring only fiveexamples and two human experts.

Quick Read (beta)

loading the full paper ...