Abstract
Recent advances in large language models (LLMs) have shown that they cananswer questions requiring complex reasoning. However, their ability toidentify and respond to text containing logical fallacies or deliberatelymisleading premises remains less studied. To address this gap, we introduceRuozhiBench, a bilingual dataset comprising 677 carefully curated questionsthat contain various forms of deceptive reasoning, meticulously crafted throughextensive human effort and expert review. In a comprehensive evaluation of 17LLMs from 5 Series over RuozhiBench using both open-ended and two-choiceformats, we conduct extensive analyses on evaluation protocols and resultpatterns. Despite their high scores on conventional benchmarks, these modelsshowed limited ability to detect and reason correctly about logical fallacies,with even the best-performing model, Claude-3-haiku, achieving only 62%accuracy compared to the human of more than 90%.