Are Large Language Models Reliable Argument Quality Annotators?

Abstract

Evaluating the quality of arguments is a crucial aspect of any systemleveraging argument mining. However, it is a challenge to obtain reliable andconsistent annotations regarding argument quality, as this usually requiresdomain-specific expertise of the annotators. Even among experts, the assessmentof argument quality is often inconsistent due to the inherent subjectivity ofthis task. In this paper, we study the potential of using state-of-the-artlarge language models (LLMs) as proxies for argument quality annotators. Toassess the capability of LLMs in this regard, we analyze the agreement betweenmodel, human expert, and human novice annotators based on an establishedtaxonomy of argument quality dimensions. Our findings highlight that LLMs canproduce consistent annotations, with a moderately high agreement with humanexperts across most of the quality dimensions. Moreover, we show that usingLLMs as additional annotators can significantly improve the agreement betweenannotators. These results suggest that LLMs can serve as a valuable tool forautomated argument quality assessment, thus streamlining and accelerating theevaluation of large argument datasets.

Quick Read (beta)

loading the full paper ...