Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Abstract

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-artAI capabilities. However, recent studies have shown that CoT reasoning is notalways faithful when models face an explicit bias in their prompts, i.e., theCoT can give an incorrect picture of how models arrive at conclusions. We gofurther and show that unfaithful CoT can also occur on realistic prompts withno artificial bias. We find that when separately presented with the questions"Is X bigger than Y?" and "Is Y bigger than X?", models sometimes producesuperficially coherent arguments to justify systematically answering Yes toboth questions or No to both questions, despite such responses being logicallycontradictory. We show preliminary evidence that this is due to models'implicit biases towards Yes or No, thus labeling this unfaithfulness asImplicit Post-Hoc Rationalization. Our results reveal that several productionmodels exhibit surprisingly high rates of post-hoc rationalization in oursettings: GPT-4o-mini (13%) and Haiku 3.5 (7%). While frontier models are morefaithful, especially thinking ones, none are entirely faithful: Gemini 2.5Flash (2.17%), ChatGPT-4o (0.49%), DeepSeek R1 (0.37%), Gemini 2.5 Pro (0.14%),and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful IllogicalShortcuts, where models use subtly illogical reasoning to try to make aspeculative answer to hard maths problems seem rigorously proven. Our findingsraise challenges for strategies for detecting undesired behavior in LLMs viathe chain of thought.

Quick Read (beta)

loading the full paper ...