A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Abstract

Argument mining (AM) is an interdisciplinary research field that integratesinsights from logic, philosophy, linguistics, rhetoric, law, psychology, andcomputer science. It involves the automatic identification and extraction ofargumentative components, such as premises and claims, and the detection ofrelationships between them, such as support, attack, or neutrality. Recently,the field has advanced significantly, especially with the advent of largelanguage models (LLMs), which have enhanced the efficiency of analyzing andextracting argument semantics compared to traditional methods and other deeplearning models. There are many benchmarks for testing and verifying thequality of LLM, but there is still a lack of research and results on theoperation of these models in publicly available argument classificationdatabases. This paper presents a study of a selection of LLM's, using diversedatasets such as Args.me and UKP. The models tested include versions of GPT,Llama, and DeepSeek, along with reasoning-enhanced variants incorporating theChain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperformsthe others in the argument classification benchmarks. In case of modelsincorporated with reasoning capabilities, the Deepseek-R1 shows itssuperiority. However, despite their superiority, GPT-4o and Deepseek-R1 stillmake errors. The most common errors are discussed for all models. To ourknowledge, the presented work is the first broader analysis of the mentioneddatasets using LLM and prompt algorithms. The work also shows some weaknessesof known prompt algorithms in argument analysis, while indicating directionsfor their improvement. The added value of the work is the in-depth analysis ofthe available argument datasets and the demonstration of their shortcomings.

Quick Read (beta)

loading the full paper ...