Abstract
Leveraging outputs from multiple large language models (LLMs) is emerging asa method for harnessing their power across a wide range of tasks whilemitigating their capacity for making errors, e.g., hallucinations. However,current approaches to combining insights from multiple LLMs often involveunstructured interactions (e.g., free debate), resulting in model generationsthat are not faithfully justifiable. In this work, we introduce MArgE, a novelframework to provide formal structure to the evidence from each LLM, in theform of a tree of extracted arguments, for the task of claim verification. Weuse a variant of Argumentative LLMs (ArgLLMs), i.e. LLMs driven by frameworksand semantics from the field of computational argumentation, to constructstructured argument trees for given claims. This process creates an inspectablepathway from the initial arguments to the final claim verification decisions,providing a faithful justification thereof. We show experimentally that MArgEcan significantly outperform single LLMs, including three open-source models(4B to 8B parameters), GPT-4o-mini and existing ArgLLMs, as well as priormethods for unstructured multi-LLM debates. We thus demonstrate the advantagesof incorporating formal, argumentative reasoning mechanisms when combiningmultiple LLM outputs.