Abstract
Natural Language Processing and Generation systems have recently shown thepotential to complement and streamline the costly and time-consuming job ofprofessional fact-checkers. In this work, we lift several constraints ofcurrent state-of-the-art pipelines for automated fact-checking based on theRetrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, undermore realistic scenarios, RAG-based methods for the generation of verdicts -i.e., short texts discussing the veracity of a claim - evaluating them onstylistically complex claims and heterogeneous, yet reliable, knowledge bases.Our findings show a complex landscape, where, for example, LLM-based retrieversoutperform other retrieval techniques, though they still struggle withheterogeneous knowledge bases; larger models excel in verdict faithfulness,while smaller models provide better context adherence, with human evaluationsfavouring zero-shot and one-shot approaches for informativeness, and fine-tunedmodels for emotional alignment.