Abstract
Large Language Models (LLMs) have achieved significant progress in languageunderstanding and reasoning. Evaluating and analyzing their logical reasoningabilities has therefore become essential. However, existing datasets andbenchmarks are often limited to overly simplistic, unnatural, or contextuallyconstrained examples. In response to the growing demand, we introduceSmartyPat-Bench, a challenging, naturally expressed, and systematically labeledbenchmark derived from real-world high-quality Reddit posts containing subtlelogical fallacies. Unlike existing datasets and benchmarks, it provides moredetailed annotations of logical fallacies and features more diverse data. Tofurther scale up the study and address the limitations of manual datacollection and labeling - such as fallacy-type imbalance and labor-intensiveannotation - we introduce SmartyPat, an automated framework powered by logicprogramming-based oracles. SmartyPat utilizes Prolog rules to systematicallygenerate logically fallacious statements, which are then refined into fluentnatural-language sentences by LLMs, ensuring precise fallacy representation.Extensive evaluation demonstrates that SmartyPat produces fallacies comparablein subtlety and quality to human-generated content and significantlyoutperforms baseline methods. Finally, experiments reveal nuanced insights intoLLM capabilities, highlighting that while excessive reasoning steps hinderfallacy detection accuracy, structured reasoning enhances fallacycategorization performance.